UNIVERSi Ty 


MICHIGAN 


SEP 12 1950 


SCIENCE 


2sychometrikal 


A JOURNAL DEVOTED TO THE DEVEL- 
OPMENT OF PSYCHOLOGY AS A 
QUANTITATIVE RATIONAL SCIENCE 


















































THE PSYCHOMETRIC SOCIETY - #ORGANIZED IN 1935 








(OLUME 15 | : 
IUMBER 3 
PTEMBER 


19 5 0 











PSYCHOMETRIKA, the official journal of the Psychometric Society, is devoted to 
the development of psychology as a quantitative rational science, Issued four ” 
times a year, on March 15, June 15, September 15, and December 15 


SEPTEMBER 1950 VOLUME 15, NUMBER 3 


Printed for the Psychometric Society at 28 West Colorado Avenue, Colorado 
Springs, Colorado. Entered as second class matter, September 17, 1940, at the 
Post Office of Colorado Springs, Colorado, under the act of March 3, 1879. Edi- , 
torial Office, Department of Psychology, The University of North Carclisin, 

Chapel Hill, North Carolina. 


Subscription Price: The regular subscription rate is $10.00 per volume. The sub- 
scriber receives each issue as it comes out, and a second complete set for binding 
at the end of the year. All annual subscriptions start with the March issue and — 
cover the calendar year. All back issues are available. The price is $1.25 per 
issue or $5.00 per volume (one set only). Members of the Psychometric Society 
pay annual dues of $5.00, of which $4.50 is in payment of a subscription to 
Psychometrika, Student members of the Psychometric Society pay annual dues 
of $3.00, of which $2.70 is in payment for the journal. 


Application for membership and student membership in the Psychometrie Society, 
together with a check for dues for the calendar year in which application is 
made, should be sent to 


T. GAYLORD ANDREWS, Chairman of the Membership Committee 
Department of Psychology, The University of Maryland, College Park, 
Maryland 


Payments: All bills and orders are payable in advance. Checks covering mem- 
bership dues should be made payable to the Psychometric Society. Checks cover- 
ing regular subscription to Psychometrika and back issue orders should be made — 
payable to the Psychometric Corporation. All checks, notices of change of ad- 
dress, and business communications should be addressed to 


Rosert L. THORNDIKE, Treasurer, Psychometric Society and Psychometrie 
Corporation 

Teachers College, Columbia University 

New York 27, New York 


Articles on the following subjects are published in Psychometrika: 


(1) the development of quantitative rationale for the solution of psychologi- 
cal problems; 
(2) general theoretical articles on quantitative methodology. in the social and 


biological sciences; 
(3) new mathematical and statistical techniques for the evaluation of psy- 


chological data; “ 
(4) aids in the application of statistical techniques, such as homographs, 
tables, work-sheet layouts, forms, and apparatus; 
(5) critiques or reviews of significant studies involving the use of quantita- 


tive techniques. 
The emphasis is to be placed on articles of type (1), in so far as articles of this 
type are available. 
In the selection of the articles to be printed in Psychometrika, an effort is made 
to obtain objectivity of choice. All manuscripts are received by one person, who 
(Continued on the back inside. cover page) 









































Psychometrika 





CONTENTS 


THE PROBLEM OF CLASSIFICATION OF PERSONNEL - 
ROBERT L. THORNDIKE 


CHANGES IN COMMON-FACTOR LOADINGS AS TESTS 
ARE ALTERED HOMOGENEOUSLY IN LENGTH 


J. P. GUILFORD AND WILLIAM B. MICHAEL 


A TEST OF THE EQUALITY OF STANDARD ERRORS 
OF MEASUREMENT - - - - - =| = 
BERT F. GREEN, JR. 


THE RELIABILITY OF SPEEDED TESTS - - - - 
HAROLD GULLIKSEN 


THE DISCRIMINATION OF TWO RACIAL SAMPLES - 
PAUL HORST AND STEVENSON SMITH 


AN EXPERIMENTAL STUDY OF THE EFFECTS ON 
ITEM-ANALYSIS DATA OF CHANGING ITEM 
PLACEMENT AND TEST TIME LIMIT - - 


WILLIAM G. MOLLENKOPF 


A PROPOSED METHOD FOR ABSOLUTE SCALING - 
ANDREW L. COMREY 


PALMER O. JOHNSON. Statistical Methods in Research 
A Review by WILLIAM B. MICHAEL 


J. P. GUILFORD AND WILLIAM B. MICHAEL. The Predic- 
tion of Categories from Measurements - - - - 


A Review by HUBERT E. BROGDEN 
BOOKS RECEIVED - - - - = - = = = = = 








VOLUME FIFTEEN SEPTEMBER 1950 NUMBER THREE 








NOTICE 


The Educational Testing Service is offering for 1951-52 
its fourth series of research fellowships in psychometrics lead- 
ing to the Ph.D. degree at Princeton University. Open to men 
who are acceptable to the Graduate School of the University, 
the two fellowships each carry a stipend of $2,375 a year and 
are normally renewable. 

Fellows will be engaged in part-time research in the gen- 
eral area of psychological measurement at the offices of the 
Educational Testing Service and will, in addition, carry a 
normal program of studies in the Graduate School. Compe- 
tence in mathematics and psychology is a prerequisite for ob- 
taining these fellowships. Information and application blanks 
may be obtained from: Director of Psychometric Fellowship 
Program, Educational Testing Service, 20 Nassau Street, 
Princeton, New Jersey. 








PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


THE PROBLEM OF CLASSIFICATION OF PERSONNEL* 


ROBERT L. THORNDIKE 
TEACHERS COLLEGE, COLUMBIA UNIVERSITY 


The personnel classification problem arises in its pure form 
when all job applicants must be used, being divided among a num- 
ber of job categories. The use of tests for classification involves 
problems of two types:, (1) problems concerning the design, choice, 
and weighting of tests into a battery, and (2) problems of estab- 
lishing the optimum administrative procedure of using test results 
for assignment. A consideration of the first problem emphasizes 
the desirability of using simple, factorially pure tests which may 
be expected to have a wide range of validities for different job 
categories. In the use of test results for assignment, an initial 
problem is that of expressing predictions of success in different jobs 
in comparable score units. These units should take account of pre- 
dictor validity and of job importance. Procedures are described for 
handling assionment either in terms of daily quotas or in terms 
of a stable predicted yield. 


The past decade, and particularly the war years, have witnessed 
a great concern about the classification of personnel and a vast ex- 
penditure of effort presumably directed towards this end. In all 
branches of the military establishment were found “general classifi- 
cation” tests or test batteries planned to serve a classification func- 
tion. Since the war the number of published test batteries designed 
for differential prediction has rapidly multiplied. It seems timely, 
therefore, to look into the problem of the classification of personnel 
to see what the concept means, what issues it raises with respect to 
the theory of measurement, and what problems it presents with re- 
spect to the practical operation of a testing program. 

It must be indicated that much of the present discussion repre- 
sents an examination of concepts, a raising of questions, and an 
offering of intuitive suggestions, rather than a presentation of mathe- 
matically established answers. The defining of questions represents 
a first step in answering them. It is hoped that clarification of the 
problems and issues in the following pages may stimulate others to 


solve them. 
Personnel classification, as the term is used here, is best de- 


*Address of the President of the Division on Evaluation and Measurement 
of the American Psychological Association, delivered at Denver, Colorado, Sep- 


tember 9, 1949. 
215 








216 PSYCHOMETRIKA 


fined by contrasting it with personnel selection. In the pure case 
of personnel selection we deal with a single job category, we have 
a limited number of job vacancies and a surplus of job applicants, 
and our problem is to pick the most promising of the applicants to 
fill the vacancies. In the pure classification program, by contrast, 
we are concerned simultaneously with a number of job categories, 
we have no more workers than there are jobs to be filled, and our 
problem is to decide which job shall be done by which individual. An 
example of a pure selection situation is that faced by a medical 
school which has 1000 applicants for admission and wishes to pick 
from this group the 100 who may be expected to succeed best in the 
curriculum of the school or in the professional duties of a doctor. 
The pure classification situation is most nearly approached in the 
military establishment, where a large flow of untrained youths con- 
tinually pours into the organization and must be channelled into 
dozens of different types of specialized training and work, and where 
everyone who meets minimum screening standards must be used in 
some capacity. The simplest paradigm of the selection situation 
is a follows: We have a job vacancy X and applicants A, B, and C. 
Which individual should get the job? Reduced to the simplest form, 
the classification problem may be expressed: We have a vacancy in 
each of three jobs, X, Y, and Z and we have three applicants A, B, 
and C. Which applicant should be put to work in which job? 

In practice, of course, selection and classification occur not only 
in pure form, but also mixed. Thus if we have three jobs, X, Y, and 
Z, and add to the three applicants A, B, and C a fourth applicant 
D, our problem now involves elements of both classification and se- 
lection. We must not only sort out our applicants into the several 
job categories, but also reject the least promising. The mixed case 
will be fairly commonly found in practical personnel situations. The 
emphasis will vary from one situation to another, with now selec- 
tion and now classification dominating. Before we can hope to un- 
derstand the mixed case, however, we must try to understand the 
relatively simpler pure case. For that reason, most of what is said 
here will deal with the problems involved in a pure case of person- 
nel classification. 

In the past fifty years, the procedures and statistical rationale 
for selection have been fairly fully worked out. The cornerstone of 
that rationale is the multiple regression equation, through which a 
series of scores for the individual may be combined in a linear ex- 
pression which will yield a maximally accurate prediction of some 








ROBERT L. THORNDIKE 217 


criterion of job success. Though plenty of problems remain with 
respect to the detailed application of multiple regression techniques 
to the development of a battery of prediction tests, choice from 
among them, and combination of the tests into a battery, the main 
outlines for obtaining maximum prediction of a single job criterion 
are clear and are familiar to every student of tests. In the case of 
the classification problem, however, no such mathematically best so- 
lution has been formulated. (It is, of course, possible that no solu- 
tion exists.) 

Before we can hope for a solution to the problem of classifica- 
tion, we must see to it that the problem is precisely formulated and 
clearly stated. What, in its essence, is the problem? Reduced to its 
basic elements and to its pure form, it may be stated as follows: 


Given: A set of k job categories with N vacancies to be filled 
(N 2 k), and N individuals to be used in filling them. 


Required: To assign the individuals to the jobs in such a way 
that the average success* of all the individuals in all the jobs to 
which they are assigned will be a maximum. 


TABLE 1 
Aptitudes of Three Individuals for Three Jobs 


Individual JobA JobB JobC 











I 55 60 65 
II 50 50 55 
Ill 45 50 45 





We can illustrate this by the example of three men and three 
jobs presented in Table 1. Suppose that those represent perfectly 
valid measures of aptitudes of the three men for the three jobs, all 
scores being expressed in standard scores of some reference popu- 
lation. There are, of course, six permutations of assignment of the 
men to the jobs, and examination of these quickly shows that the 
sum of the aptitude scores is a maximum when I is assigned to Job 
C, II to Job A, and III to Job B. The sum is then 165. It is not nec- 
essary that the jobs be equally weighted. Thus, if it were especially 
important to have the best talent in Job B, for example, that job 
might be given triple weight. The maximum of A + 3B + C is ob- 
tained by assigning individual I to Job B, II to C, and III to A. 

*The term “success”, as it is used in this paper, may be interpreted quite 


broadly to include measures of job satisfaction as well as ratings of performance 
or measures of production. 








218 PSYCHOMETRIKA 


This miniature example shows many of the essential character- 
istics of the classification problem. The problem is one of simultane- 
ous assignment of all individuals, because the assigning of one man 
can only be made with reference to that of others. The separate 
decisions are not independent. Typically, some individuals must be 
assigned to jobs other than the ones for which they have the most 
aptitude. (In this illustration, Individual II had in one case to be 
assigned to Job A rather than Job C.) Differences in levei of apti- 
tude within the individual emerge as a dominant factor in assign- 
ment. Thus, in one case Individual III was assigned to job A, al- 
though his aptitude for that job was lower than that of either of the 
other men, because he had no higher aptitude for any other job. 

There are, as has been indicated, a finite number of. permuta- 
tions in the assignment of men to jobs. When the classification prob- 
lem as formulated above was presented to a mathematician, he 
pointed to this fact and said that from the point of view of the mathe- 
matician there was no problem. Since the number of permutations 
was finite, one had only to try them all and choose the best. He dis- 
missed the problem at that point. This is rather cold comfort to the 
psychologist, however, when one considers that only ten men and 
ten jobs mean over three and a half million permutations. Trying 
out all the permutations may be a mathematical solution to the prob- 
lem, it is not a practical solution. 

The classification enterprise involves two quite distinct groups 
of problems. One group centers around the development and choice 
of the tests which are to comprise the classification battery. The 
other group concerns procedures for using the test scores in classi- 
fication, given a particular battery of predictors. This matter of 
finite number of permutations, referred to in the previous paragraph, 
has to do with the second group of problems — those of using the 
information from a set of aptitude predictors. We will return to 
this part of the problem a little later. For the present, however, let 
us turn our attention to the problems involved in developing a bat- 
tery of tests for use in classification. 


Selection of Tests for a Classification Battery 


We may ask: What attributes should a test have if it is to be 
a useful member of a classification test battery? What should be the 
joint characteristics of a set of tests which are to form a battery to 
be used for classification? 

Perhaps the best way to approach these questions is to consider 














RII EARLE TH 





ROBERT L. THORNDIKE 219 


the pseudo-classification problem of dividing men between two jobs. 
This is termed a pseudo-classification problem because the limitation 
to two job categories permits us to direct our attention to differences 
in aptitude for the two jobs and thus to reduce the problem to one 
of selection. That is, we are able to select men on the basis of a sin- 
gle score representing difference in aptitude. When there are three 
or more jobs, we have a number of difference scores and are thrown 
back upon the true classification problem. However, the solution of 
the two-category problem serves to point up for us the qualities 
which we shall need to look for in tests and in test batteries for the 
more general classification problem with three or more jobs. 

For each of two job categories, A and B, we can make our best 
prediction, in the least squares sense, of further success on the job. 
These may be expressed as 


Ya = Bra @ + Boa 2 tree + Bra 2x » 


and 7 (1) 
Ys = Bip 21 + Bop 2 +++ + Puede , 
where fi, is the weight to be applied to standard scores in variable 
i is predicting success in Job A and fiz is the weight for Job B. Now 
what we are currently interested in is whether the individual is 
likely to be more successful in Job A or in Job B. That is, we are 
interested in a prediction of difference in success. This is, of course, 
the simple difference* between our two predictions. Let us call this 


A , where 

A=Y—Ys- (2) 
For purposes of assignment, individuals could be arranged in rank 
order with respect to A , and a dividing line set which would assign 
the required number to each job. 


Referring to Equations (1), and performing the subtraction in- 
dicated in Equation (2), we get 


A= (Bia — Bis) %: + (Boa — Ben) 22 + -** + (Bea — Bus) 2x « (3) 


We see, then, that the weight which a predictor receives in predict- 
ing this difference score is the difference between its weights for the 


: *For the present, we are considering each of the two jobs to be equally 
important, and are dealing with the simple difference. It would also be possible 
to deal with a weighted difference, thereby attaching greater importance to one 
of the jobs than to the other. 








220 PSYCHOMETRIKA 


- 


two separate job categories. Under what circumstances will it re- 
ceive a large differential weighting? 

Let us assume that we have only two predictors, 1 and 2. In this 
case, we find that the differences in weight are given by the for- 


mulas — 


i. re (114 — 118) — 112 (724 — 2B) 








P14 ~~ P1iB— i ts 12? 
and (4) 
sa” *on7 *oeir7A = Y 38 
: San Tos) — Tie(7r YB) 
P2Aa ~ P2B — 1 rae . 
12 





An inspection of these formulas reveals that, ordinarily, a test 
will receive a substantial differential weight if it (a) has a substan- 
tial difference in validity for the two criteria in question and (b) 
does not have a high positive correlation with other tests which dif- 
ferentiate in the same direction or a high negative correlation with 
other tests which differentiate in the reverse direction. Occasionally 
a test may be found which receives a substantial differential weight 
in spite of having the same validity for both jobs, because it has a 
substantial correlation with another variable which does have a sharp 
validity difference between the two job categories. This is analogous 
to the suppression variable in the usual selection problem, in which 
a variable with near zero validity may receive a substantial negative 
weight if it has a high correlation with some variable or variables 


with high validity. 


TABLE 2 
Relation of Differential Weights and Differential Validity 
to Test Validities and Intercorrelation 








Example 

1 2 3 4 5 6 

* 30 30 0 30 380 .20 
Tp 30 10 10 10 .00 .20 
Fos 40 15 15 15 .00 .20 
Pon 40 45 45 45 45 .60 
Ves .50 .80 .40 .00 .00 .60 
ae .00 .00 .00 .00 00 -00 
Bucs-ast .00 1.22 39 .20 30 38 
Resaas 00 —1.28 —.46 —.30 —.45 —.62 
o~ .00 54 33 .26 38 45 

(A-B) (A-B) 





*ria, 1B, T24, Ton are correlations of Tests 1 and 2 with job criteria A and B respectively. 
tBiu4-s) and Ba4-z) are the regression weights for differential prediction. 


TR.4-B)(4-B) is the correlation between actual aptitude difference and predicted aptitude differ- 
ence. 








ROBERT L. THORNDIKE 221 


Several examples of the weights which variables may receive 
as differential predictors are given in Table 2, and serve to illustrate 
the interaction of validities and intercorrelations. 

The first example is a pair of tests both of which have the same 
validities for both criteria. In this case, of course, neither test has 
any differential validity, neither receives any weight as a differen- 
tial predictor, and the validity of the differential prediction is ex- 
actly zero. The last row of the table is a row of validities for differ- 
ential prediction. We will turn to the formula for this value shortly. 

Examples 2, 3, and 4 ali involve the same set of validities, but 
different correlations between the two tests. Here Test 1, which is 
more valid for Criterion A, receives a positive weight; while Test 
2, which is less valid for A, gets a negative weight. With a high 
positive correlation between the two tests, the weights are high and 
the differential validity of the pair high. This corresponds to the 
(highly improbable) situation in a selection battery of finding a 
pair of tests, one with positive and one with negative validity, which 
have a high positive correlation. As the correlation between the tests 
drops, their differential weights drop; and their validity for differ- 
ential prediction drops also. 

The comparison of Examples 4 and 5 brings out the importance 
for differential prediction of getting tests which, while valid for one 
variable, have zero or negative validity for another. The drop in va- 
lidity of Test 1 for Criterion B from .10 to .00 and of Test 2 for 
Criterion A from .15 to .00 raises the validity of the pair of tests for 
differential prediction from .25 to .38 with the existing pattern of 
test and criterion correlations. 

Example 6 illustrates the suppression variable in the context of 
differential prediction. Test 1, which has equal validity for both cri- 
teria, receives a positive weight in differential prediction, because of 
its substantial correlation with Test 2, which is much less valid for 
Criterion A than for Criterion B. 

Table 2 probably directs undue attention to the correlation be- 
tween the test variables, because of the way the illustrations were 
selected. Usually it is likely to be the difference in validity which is 
the critical matter. The table suggests that in appraising a test for 
addition to a classification battery we should be as vitally concerned 
that it have vanishing validity for some job categories as that it have 
high validity for others. 

We might turn next to a consideration of the expression for the 


validity of A , our differential prediction. This can be obtained from 








222 PSYCHOMETRIKA 


the familiar formula for the correlation of sums and differences. We 
let 

A stand for composite score predicting success in Job A, 

A stand for the actual success in Job A, 


B stand for composite score predicting success in Job B, and 
B stand for actual success in Job B. 


Then (A — B) = J is the predicted difference in success on the two 
jobs, and (A —B) is the actual difference in success. The validity 


of the differential prediction is 





ee ee as ae ab 


B 
PM kiteias =] 2(1— ras) 


This is the formula which was used for computing the values in the 
last row of Table 2. 

The relationships involved in Formula (5) are brought out more 
clearly in the special case in which R ~ =#,;- The formula then 


becomes 
— [Re Za—755) 
W ches (l—tas) ee 


We can see that differential validity depends upon three con- 
siderations. Other things being equal, high validity of differential 
prediction will result when (1) the validity of prediction for the 
separate jobs is high, (2) the correlation between the weighted com- 
posite scores for predicting success in the two jobs is low or, better 
still, negative, and (3) the correlation of true success on the two jobs 
is high. 

Only the first and second of the above considerations relate to 
the qualities of our test battery; the third represents a constant when 
we are working with a particular pair of jobs. If valid differential 
prediction is to be achieved, the two equally important considerations 
are that separate validities be high and that the intercorrelations of 
the separate score composites be low. Neglecting the case of the 
suppression test, which is rarely met in practice, we can say that 
the first condition will be satisfied insofar as for each of the jobs 
there are one or more tests which have high validity, and insofar as 


(5)* 











*This formula, which replaces an erroneous formula included in the origi- 
nal paper, was derived by William G. Mollenkopf. Its development is presented 
in Research Bulletin 50-9, Educational Testing Service, Princeton, N. J 








TRICE, 


— 











ROBERT L. THORNDIKE 223 


the correlations between the tests are low. The second condition will 
be achieved insofar as the tests which have a high weight for one 
job have low or zero weights for others, and insofar as the correla- 
tions between tests are low. Our objective, then, is a battery of tests 
in which each test has high validity for one or two jobs but has near 
zero validity for the others, and in which the intercorrelations of the 
separate tests are low. 

Under what circumstances is a test likely to approach such a 
validity pattern? One would judge that maximum simplicity, purity, 
and univocality of the function measured is the governing condition. 
If we conceive of abilities as having primarily either positive or zero 
validity for a job, then the fewer abilities a test taps the greater is 
the number of job categories for which it may have zero validity 
and the greater is the value it can have for purposes of classification. 
The classification situation seems to be one in which the simple, fac- 
torially pure test comes into its own. In the creative and research 
activity underlying our test construction, then, efforts should be di- 
rected to measuring as many attributes of human behavior in as pure 
form as possible. 

Suppose now that we have a large pool of tests, from which we 
wish to select a limited number to constitute a classification battery. 
How shall we go about selecting the optimum pool of tests? Here 
again, for a simple selection program, the enterprise should be rela- 
tively straightforward. We should undertake simply to select those 
with the largest regression weights. In the case of a classification 
program, our problem is a good deal more complex. Our objective is 
to be able to differentiate aptitude for any one of the jobs or job 
families into which we are classifying individuals, from aptitude for 
any other one. In our thinking we can perhaps best approach this 
in terms of the hypothetical pure factors of factor analysis. If we 
had knowledge of the factor composition of job success for each of 
the jobs in which we are interested, we could specify what we would 
like to have in the tests of our battery. We would ideally like a high- 
ly reliable pure test of each of the factors which appears in certain 
job criteria but not in others. Of secondary desirability would be 
tests of factors appearing with very different loadings in the differ- 
ent job criteria. Such a battery would permit maximally valid dif- 
ferential prediction of success in the different jobs. 

Though it may be relatively easy to recognize what would con- 
stitute the ideal for a classification program, it is much more difficult 
to provide guidance for selection from among a pool of non-ideal 
existing tests. Where the tests possess varying degrees of factorial 








224 PSYCHOMETRIKA 


purity, varying levels of validity, varying degrees of correspondence 
in factorial pattern, any rigorous rules for selecting one particular 
set of tests will be very difficult to formulate, and no such rules are 
offered at the present time. The problem is further complicated by 
the fact that in practice we can rarely limit our activities to pure 
classification, because an element of selection is often included. It 
will often be necessary to compromise between tests which are out- 
standing in differential validity and tests which are high in general 
validity for a wide range of jobs. In application, then, the element 
of practical judgment will continue to bulk large in our choice of 
tests to constitute a battery. We can only hope that that judgment 
may be provided with a better set of guiding principles in the future 
than it has had in the past. 


Use of Test Scores to Accomplish Classification 


Now let us assume that the battery of tests which we are going 
to use in our classification program is fixed, and inquire how we 
shall use the battery of tests to accomplish the assignment of men. 

First of all, we will want to combine single test scores into 
weighted composite scores, one for each job. Except in the special 
case of two job categories which we considered earlier, in which it 
is possible to direct our efforts toward the prediction of a single 
difference in aptitude between the two jobs, I see no escape from 
the step of predicting success in each of the several jobs singly.* 
Thus, for each job category we will need just the same weighted 
composite of test scores which we would require if our problem were 
merely to select persons for that job on the basis of the battery of 
tests which we have assembled. 

Our next need is to express these score composites for the dif- 
ferent jobs in units which will make it most convenient to compare 
the individual’s probable success in the different jobs. Clearly we 
need some single type of standard score scale for all of the job spe- 
cialties. 

One possibility is to take the composite aptitude scores for each 
job, just as they stand, and transform them into standard scores 
with the same mean and standard deviation. Quite possibly we may 
wish to normalize the distribution at the same time. In the interests 
of concreteness, let us suppose that we are going to use a mean of 

*Dr. P. J. Rulon has recently reported informally on the development of 
procedures for computing a multiple discriminant function which may eliminate 


the need for predicting success for separate job categories. The full report of 
this method will be awaited with interest. 








ROBERT L. THORNDIKE 225 


50 and a standard deviation of 10. Then for the standard reference 
population, the distribution of prediction scores for each job category 
will have a mean of 50 and a standard deviation of 10. This general 
type of score is very familiar, and has been widely used in testing 
programs of all sorts. When each test score or score composite is 
to be used in connection with a number of jobs in the same job fam- 
ily, it may represent as serviceable a type of score as it is possible 
to prepare. When, however, each of several score composites is to 
be used to select men for one specific job, the usual standard score 
units seem to have two limitations. First, they do not take any 
account of the differences in validity of the different score compos- 
ites. Second, they do not take any account of differences in impor- 
tance of the several jobs. Let us examine these two points. 

Suppose, as an extreme example and for purposes of clarification, 
we have score composites for Jobs A and B which have validities 
(against equally relevant and reliable criteria of success) of .20 and 
.80 respectively. Suppose we have a job applicant who is one stand- 
ard deviation above the mean on each of these score composites. What 
does this mean with respect to his probable success in the two jobs? 
In the first job, where the predictor has a validity of .20, our best 
estimate is that the applicant will be two-tenths of a standard devia- 
tion above the mean in job success. In the second job, by contrast, 
we may expect he will be eight-tenths of a standard deviation above 
the mean. The standard score of 60 will have very different signifi- 
cance in these two cases. It will stand for sharply different levels of 
expected success. 

We may think of true or actual achievement as consisting of two 
components, one which is predicted by our score composite and 
one which is not. Similarly, variance in true achievement consists 
of two parts, predictable variance and non-predictable variance. The 
predictable variance will be R? times the total variance, where FR is 
the validity of the test composite. It is proposed that more meaning- 
ful comparability of score units for different job categories is 
achieved when the variance is made equal for scores representing 
actual achievement in the various jobs. The variance of the compos- 
ite predictor scores will then be proportional to R?, and the stand- 
ard deviations proportional to R. In the example, where Composite 
A had a validity of .20 and Composite B one of .80, the two score 
distributions should not have equal standard deviations, but their 
standard deviations should stand in the ratio of .20 to .80. Thus, if 
the standard deviation of true achievement were set equal to 10 for 
each job, the standard deviation of composite scores would be 2 for 





226 PSYCHOMETRIKA 


Job A and 8 for Job B. Thus, our applicant who fell one standard 
deviation above the mean in both aptitude composites would receive 
a converted score of 52 for Job A and one of 58 in Job B. These 
scores would correspond to our best prediction that he would fall 
two-tenths of a standard deviation above the mean in success on Job 
A and eight-tenths of a standard deviation above on Job B. 

This type of score conversion seems to facilitate direct com- 
parison of probabilities of success in different jobs. The same nu- 
merical value now signifies the same probable status in the group in 
each job. The highest numerical value for an individual corresponds 
to the job in which we should predict highest success for him relative 
to his fellows. Where differences in battery validity for different 
jobs are substantial, tempering our expectation of individual job per- 
formance by what we know of the validity of the predictor for each 
job should represent a significant improvement in interpretation of 
individual aptitudes. 

A second adjustment of score composites which should perhaps 
enter in before the composites are used as a basis for assignment is 
weighting them in accordance with the importance of the job. Thus, 
let us assume that we have a flow of incoming recruits who must be 
assigned either to electronic technicians school, to aviation mechanics 
school, or to cooks and bakers school. Let us assume that current 
evaluation of the needs of the Service weights electronic technicians 
5, aviation mechanics 2, and cooks and bakers 1. A score which is to 
be used for classification might well incorporate these weights as 
multiplying factors, to insure that even slight superiority in a high- 
priority job resulted in assignment to that job. In some cases, of 
course, there may not be enough difference in the importance of dif- 


TABLE 3 
Illustration of Standard Scores Weighted for Validity 
and Job Importance 











Electronic Aviation Cook- 
Technician Mechanic Baker 
Importance of factor 5 2 1 
Validity of test composit 60 .70 40 
S. D. equally weighted true criterion 10 10 10 
scores 
S. D. — adjusted for validity 6 vf 4 
(predicted criterion scores) 
S. D. — adjusted for validity 30 14 4 


and importance 
Mean — all score distributions 50 50 50 




















ROBERT L. THORNDIKE 227 


ferent jobs or not enough may be known about the importance of dif- 
ferent jobs to make make this type of weighting fruitful. 

Let us carry our illustration further, and combine these two 
types of weighting factors. Suppose that for the three jobs which 
we have just considered, the validity coefficients against comparably 
good criterion measures are respectively. 60, .70, and .40. Then, the 
standard deviation of converted scores for Job A (electronic tech- 
nician) might be 10 X 5 X .60 = 30. For aviation mechanic it would 
be 14; and for cooks and bakers, 4. The picture is summarized in 
Table 3. The final converted scores, all with the same mean, but with 
standard deviations adjusted both for validity and for job impor- 
tance, represent a type of score which appears to permit the simplest 
and most direct comparison, with view to personnel assignment.* We 
must now consider how these scores, or scores derived by some other 
system, might be used in the operation of assignment. 

Three general ways may be proposed in which a set of compos- 
ite scores could be used for the assignment of personnel in a classi- 
fication situation. For purposes of identification these may be called 
the Method of Divine Intuition, the Method of Daily Quotas, and the 
Method of Predicted Yield. They will be discussed in that order. 

The essence of the Method of Divine Intuition is that it is not a 
method at all — or not describable in any exact terms. The individ- 
ual responsible for making recommendations or assignments looks at 
the set of aptitude scores for each individual and, reconciling these 
scores with what he knows about flow and quotas, comes by unspeci- 
fied and unspecifiable processes to a decision as to the category into 
which each man is to be put. The approach is, of course, quite un- 
standardizable. At its worst, it degenerates into the not unknown 
routine of assigning the A’s to one job, the B’s to the next, and so 

*Hubert Brogden (An approach to the problem of differential prediction. 
Psychometrika, 1946, 11, 189-154) has approached the problem of a uniform 
score scale for predictions of success in different jobs from a somewhat different 
angle. He proposes translating everything into units of dollars saved by assign- 
ing Individual I, rather than an average individual, to Job A. However, he in- 
dicates that it will usually not be feasible to get a direct estimate of this dollar 
saving, and that one will have to rely upon subjective estimates of the type dis- 
cussed in the present article. Brogden makes one point which is a desirable sup- 
plement to our discussion of a uniform scale for predictor scores. This is that 
one must take account of the extent to which efficiency varies from person to per- 
son. In some jobs the difference in level of performance may be relatively slight 
between the best individual and the average individual, while in other jobs it may 
be very great. The weighting factor for any job should take account not only 
of the importance of the job but also of the extent of individual differences in 
performance of it. The two judgments might be made separately, or they might 
be synthesized into a single compound judgment of the importance, to the success 


of the total organization, of the differences which are in fact found between in- 
dividuals in the performance of that particular job. 











228 PSYCHOMETRIKA 


on through the alphabet. At its best, it becomes the sincere but rela- 
tively unguided attempt of the individual personnel worker to recon- 
cile a host of complex aptitude profiles with a varied array of job 
assignments. 

It is as much in this unstandardized situation as anywhere that 
the adjusted type of standard score which has been described earlier 
would be helpful. It would be helpful because it would take out of 
the hands of the individual operating personnel worker the necessity 
for making judgments about the relative validity of the different 
score composites and the relative importance of the different jobs. 
These judgments would be centrally made once and for all by the 
best talent which the organization was able to bring to bear upon 
the problem. The first-line operating personnel worker could take the 
numbers at their face value, and strive simply to get every individ- 
ual into the job for which he had the highest numerical score. 

Difficulties arise when, due to quotas, it is not possible to get all 
(or even most) of the individuals into the job for which their score 
is highest. This will be the case when there is a disproportionate 
need for personnel in certain particular jobs or in certain families 
of closely related jobs. It is for these situations, and to minimize the 
subjectivity inherent in assignment by unrestrained individual judg- 
ment, that the methods of Daily Quotas and Predicted Yield are pro- 
posed. 

The Method of Daily Quotas undertakes to take the quota re- 
quirements for each day (or other specified period), and make the 
optimum adjustment of aptitudes to assignments within the limita- 
tions of the personnel and quotas available for that period. In ap- 
proaching this problem of optimum adjustment of individuals to 
quotas, we encounter the basic dilemma of classification. Due to the 
interdependence of the assignment of different individuals, the opti- 
mum allocation of individuals requires that everybody be assigned 
simultaneously. That is, in order to decide to which job any one per- 
son is to be assigned, one needs to know what other persons are can- 
didates for these same jobs and what their qualifications for each job 
are. But there is no practical way in which simultaneous assignment 
of any substantial number of individuals can be handled. Since si- 
multaneous assignment is impractical, what is required is some way 
of establishing the best order in which the individuals shall be as- 
signed to jobs. That is, since we must start with some individual and 
assign him first, with whom shall we start, who shall be assigned 
next, and in what order shall we proceed from there? 








ROBERT L. THORNDIKE 229 


The key to this problem lies in the fact, which we have previously 
noted, that it is differences in aptitude which are important for clas- 
sification. The person whose assignment has great influence upon the 
success of our classification enterprise is the person who shows a 
wide range in potential contribution in different jobs. The person 
who can make an equal contribution in all jobs can be assigned to 
any one without gain or loss to the total classification enterprise. We 
need, then, some index of variability in expected contribution, in 
terms of which we can arrange the members of our group in order, 
with a view to assigning first those with the greatest variability. 

One may question whether it will be possible to find any one in- 
dex of variability which can be shown analytically to be the best for 
use in classification. The procedure which is proposed below is ad- 
mittedly a rule-of-thumb one, but is perhaps a reasonable way of or- 
ganizing aptitude estimates for use in personnel classification. It is 
assumed that we start with aptitude estimates for each man for each 
job category, expressed in some type of comparable score units. The 
units which were proposed earlier take account of the validity of the 
score composites and the importance weighting of the jobs, but the 
procedure would be essentially the same whether this is done or not. 
The steps would be essentially as follows: 

(1) Determine for each individual a measure of spread of his 
predicted achievement scores. The problem is to determine the par- 
ticular measure of spread which will most effectively identify those 
with certain outstanding aptitudes. The measure should be simple, 
and it should give special weight to spread among the individual’s 
higher aptitude scores. It would be an exceptional situation in which 
it would be necessary to assign any individual to a job which fell in 
the lower half of his aptitude ratings, and so we do not need to con- 
cern ourselves about how inept the individual is for the jobs for 
which he is particularly inept. Often it will be a question of choos- 
ing from among the two or three highest ratings. As a practical ex- 
pedient, it is suggested that for each individual the difference be de- 
termined between his highest predicted achievement score and his 
median predicted achievement score. An alternative might be to de- 
termine the difference between his highest and next highest scores. 

(2) Arrange all the individuals in the order of their measure 
of spread, from high spread to low spread. 

(3) Assign to a job first the individual with the largest spread 
score, then the one with the next largest, and so on. Insofar as quotas 
permit, assign each individual to the job for which he has the high- 
est predicted achievement. If assignments are made in this order, 








230 PSYCHOMETRIKA 


maximum flexibility of assignment will be available for those indi- 
viduals who show the greatest spread in predicted achievement. The 
last individuals to be assigned, for whom a number of jobs may be 
excluded because the quotas are already filled, will be those who show 
relatively little difference with respect to their predicted achieve- 
ment in the several jobs. 

To explore the effectiveness of this routine, a synthetic set of 
scores was set up by drawing cards from a deck. The numbers in 
each denomination were arranged so as to yield an approximately nor- 
mal distribution of scores. Scores on ten independent “tests” were 
obtained by drawing cards from the pack. Then composite scores for 
predicting achievement in seven “jobs” were obtained by combining 
different groupings of “test”? scores. These seven score composites 
were obtained for a sample of 50 individuals. The composite scores 
were designed to have population means of 50 and population stand- 
ard deviations of from 4 to 5. 

First, the limits of zero effectiveness of classification and 100% 
effectiveness of classification were defined by determining the mean 
aptitude score when men were assigned to jobs at random, on the one 
hand, and when each man was assigned to the job for which he had 
the highest aptitude (without regard to quotas) on the other hand. 
For this sample of 50 cases, the mean scores were 48.34 and 52.16, 
respectively. The difference of 3.82, almost one standard deviation 
of the original aptitude score distributions, represents the difference 
between random assignment and ideal assignment. 

Then it was assumed that the quota was 7 for each job except 
the last, for which the quota was 8. First assignment of the men was 
made in “alphabetical” order, that is their order in the list of scores. 
As far as quotas permitted, each man was assigned to the job for 
which his “predicted achievement” was highest. Following this pro- 
cedure, the mean of all the aptitude scores was 51.82, constituting 
91% of the possible improvement over random assignment. Then 
the 50 cases were arranged in a new order, on the basis of the meas- 
ure of spread of aptitude for each individual. Assignment was made 
following this order, again assigning each individual, within quota 
limits, to the job for which he had the highest aptitude. The mean 
of all the aptitude scores of men for the jobs to which they were as- 
signed was now 51.92, representing 94% of the possible improve- 
ment over random assignment. In this particular illustration, with 
substantially equal quotas for each job, order of assignment made 
little difference; and most of the advantages of using aptitude tests 








ROBERT L. THORNDIKE 231 


could be obtained regardless of the order in which the men were as- 
signed. 

By contrast, it was next assumed that the quotas for the sev- 
eral jobs were respectively 20, 10, 5, 5, 5, 3, and 2. Under these cir- 
cumstances, assignment in “alphabetical” order achieved a mean ap- 
titude score of 50.86, or 66% of the possible improvement over ran- 
dom assignment, while assignment in order of spread of aptitudes 
achieved a mean aptitude score of 51.58, or 85% of the possible im- 
provement over random assignment. These results support the rea- 
sonable conclusion that itis when quotas are out of balance with sup- 
ply that taking account of the factor of spread makes possible im- 
proved classification. In this case, arranging the individuals in order 
of spread of aptitude scores and assigning them in that order re- 
sulted in a substantial improvement in the effectiveness of assign- 
ment. 

The great limitation of the Method of Daily Quotas as a proce- 
dure for assignment is that it is restricted by the requirements of 
fluctuating quotas. Thus, Monday’s quota may call for 100 aviation 
mechanics to fill the roster of a class entering training, and then 
there may be no further call for aviation mechanics for the rest of 
the week. If assignment operates on a short-time basis and such 
wide swings are typical, classification is reduced to a fraction of its 
possible effectiveness. It is only when quotas are reasonably stable 
from day to day, or when personnel can be pooled for a long enough 
time to permit them to become stable, that the Method of Daily Quo- 
tas can be an effective classification routine. 

One way of stabilizing assignment procedures in the face of fluc- 
tuating daily quotas is to adopt some variation of the Method of Pre- 
dicted Yield. This means, in general, forecasting the numbers re- 
quired in each job category for a period of time, and then working 
out a standard routine for assignment which will, on the average, 
yield the required proportions in each job. The procedure should be 
that which will, for the average of all persons assigned, achieve the 
maximum predicted achievement of each man in the job to which he 
is assigned. (The type of weighted standard score described in a 
previous section might well be used for expressing predicted achieve- 
ment in different job categories.) 

By way of illustration, let us suppose that we wish to divide job 
applicants among maintenance, clerical, and sales positions. Let us 
suppose, for simplicity, that the relative importance and validity fac- 
tors are such that we may use standard scores with the same stand- 
ard deviation for all three job categories. Let us further assume that 








232 PSYCHOMETRIKA 


estimates of the flow needed into the three job areas are in the pro- 
portions 50, 30, 20. We have for each individual three standard 
scores which predict his probable achievement relative to his fellows 
in each of the three areas. Our problem is to prepare some set of 
rules which will result in the assignment of the required proportions 
to the three jobs, and will at the same time maximize predicted 
achievement. How shall we set the dividing lines which will separate 
those who should be assigned to Job A from those who should be 
assigned to Job B or Job C? 

If we are willing to assume that the various regressions of suc- 
cess on aptitude composite are linear, to specify that we will use 
straight lines (or planes or hyperplanes) to separate our sub-groups 
for assignment, and to make the further assumption that the various 
aptitudes with which we are concerned are normally distributed, the 
problem appears to be susceptible to mathematical solution. A mathe- 
matical solution for the two-variable case has been proposed by Dr. 
T. F. Cope of Queens College.* The solution in this case supports the 
earlier discussion, in which it was pointed out that the two-variable 
problem can be treated as a single set of difference scores. The mathe- 
matical solution is of the form 


Z,4—2Zs,—constant, 


where Z, and Z; are standard scores, of the type we have discussed, 
for predicting success in Jobs A and B. 

Thus, to decide whether a person should be assigned to Job A or 
Job B, one merely subtracts the B score from the A score, each score 
being weighted if that seems desirable. If the difference is greater 
than a specified constant, the individual is assigned to Job A; if 
equal to or less than that constant, to Job B. The size of the con- 
stant is determined by the numbers required in the two jobs and by 
the correlation between the scores for predicted aptitude in the two 
jobs. The value of the constant k is given by an equation of the form 


= 00 a+k 
| ax { zdy=p. 


The numerical value of k can be approximated in any given case by 
preparing a distribution of difference scores, and counting off the 
required number of cases. 

If some cases are rejected for both jobs, i.e., if there is a mini- 
mum score to qualify for each job, the value of k will be determined 


*Personal communication to the author. 








ROBERT L. THORNDIKE 233 


by the two minimum scores to qualify. These qualifying scores, in 
turn, will be determined by the number of individuals required for 
each job. Brogden has shown that the constant k will, in this case, 
be equal to the difference in the two cut-off scores. 

The solution of the form Z4 — Zz = k will apparently generalize 
to three or more dimensions. In this case, one will have eS 


hyperplanes which will divide the multivariate frequency distribu- 
tion into a number of sub-regions. It is anticipated that these hyper- 
planes will all be of the form 


La—Za—k, 

LZi,—Zc=k, 

Z3—Zco—k; 
etc. 


In this case, one would have to deal merely with a set of simple dif- 
ferences between predictor scores. To assign an individual, one could 
proceed by taking the individual’s A score and subtracting from it 
his B score. The difference would tell whether he would be better 
assigned to A or B. Taking the winner from this comparison, one 
would subtract from it the C score to determine whether C was a bet- 
ter assignment than the previous choice. There would be, for each in- 
dividual, n — 1 subtractions and decisions, where n is the number of 
job categories among which placement is being made. Assignment 
would be made to the job category which survived as the winner in 
all these comparisons. The routine work for this task could be ar- 
ranged so that it would proceed quite rapidly. 

The basic problem in this case is to define the constants k,, k., 
k; , ete. in such a way as to yield the required proportion of cases in 
each of the job categories. This problem is also believed susceptible 
to an analytical mathematical solution, though the solution is not 
available at this time. Even without a mathematical solution, how- 
ever, it does not appear too laborious to determine approximate val- 
ues of the constants, based upon a set of empirical data, by an itera- 
tive procedure. The procedure can be illustrated by data for three 
variables and 162 individuals, which were analyzed by the writer. 
The three variables may be thought of as predictor scores. (As a 
matter of fact, they were final course grades and certain component 
grades from a measurement course.) Identifying them as variables 
A, B, and C, the correlations were as follows: 74s = .38; ‘ac = .66; 
Tpc = .90. For each of the variables, scores were converted to stand- 
ard scores with means of 50 and standard deviations of 10. 








234 PSYCHOMETRIKA 


The first problem was to determine values of k,, k., and k; to 
give a yield in the proportion 50 A’s, 30 B’s and 20 C’s. The initial 
estimates of the k’s were made by taking the variables by pairs and 
using the procedure described earlier. That is, variables A and B 
were considered, with the required proportion being 50 to 30, and k, 
was estimated to be 


—0.32 (10) V1 — (.38)?=—3.0. 


Similarly, k. was estimated to be —4.1 and k; to be —1.1. For the 
first empirical trial, therefore, the values —3, —5, and —2 were used 
as the k’s. The rules for assignment became: “Assign persons to 
Job A unless B score is 3 points higher than A score or C score is 
5 points higher than A score; Assign to Job B unless C score is 2 
points higher than B score. This procedure led to the following as- 
signments: 


To Job A: 90 cases, or 56% 
To Job B: 57 cases, or 35% 
To Job C: 15 cases, or 9%. 


Adjustments were then made to yield more cases in Job C, at the 
expense of both Jobs A and B. That is, the constants were changed so 
that the C scores would not have to be as much above the A and B 
scores. On the fourth trial, at the end of about half an hour, the fol- 
lowing rules for assignment were arrived at: “Assign to Job A un- 
less B or C score is 3 points higher than A score; assign to Job B 
unless C score equals or exceeds B score.” This procedure led to as- 
signments as follows: 


To Job A: 81 cases, or 50.0% 
To Job B: 48 cases, or 29.6% 
To Job C: 33 cases, or 20.4% . 


This is certainly as close correspondence to the desired percentages 
as one could hope to obtain. 

Using the same set of scores, another problem was worked out 
in which the required percentages were 10 for A, 20 for B, and 70 
for C. With three trials and about the same amount of time as be- 
fore, a set of rules was reached which gave the following results: 


ToJobA: 17 cases, or 10.5% 
ToJobB: 32 cases, or 19.6% 
To Job C: 118 cases, or 69.8% . 


In this case, the rules were: “Assign to A if A score is 10 points 











ROBERT L. THORNDIKE 235 


above B and C scores; assign to B if B score is 4 points above C 
score.” 

With a greater number of variables, or with a very large N, this 
cut-and-fit procedure would become more laborious; but even so, the 
time involved would probably be a smali item in the total planning 
of a classification program. Again, Brogden* has considered a simi- 
lar problem and worked out a similar iterative procedure for the 
situation in which a portion of the job applicants may be rejected 
and not assigned to any of the jobs. 

The Method of Predicted Yield would be feasible only in a large- 
scale program which was going to continue on a basis of stable re- 
quirements for a considerable period of time. It also requires pro- 
vision for absorbing temporary unbalance between quotas and yield. 
Within those limitations, it seems to offer maximal opportunity for 
effective classification. 

By the very nature of things, classification of personnel can 
never have the neatness and elegance that are possible in an enter- 
prise of pure selection. Classification will relatively rarely be called 
for in its pure form. Even when it is, the variety of points at which 
professional judgment must enter will increase as some function of 
the number of jobs among which the available personnel must be 
divided. This paper has tried to suggest, however, some directions 
in which we may move to make more effective our design and selec- 
tion of a test battery and our use of the resulting test scores for dif- 
ferential assignment of personnel. 


*Brogden, Hubert E. An approach to the problem of differential prediction. 
Psychometrika, 1946, 11, 139-154. 


Manuscript received 10/10/49 
Revised manuscript received 1/3/50 





le re ce ed a RP OR HY Re — Oo BD x=) 












PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


CHANGES IN COMMON-FACTOR LOADINGS AS TESTS ARE 
ALTERED HOMOGENEOUSLY IN LENGTH* 


J. P. GUILFORD 
UNIVERSITY OF SOUTHERN CALIFORNIA 
AND 
WILLIAM- B. MICHAEL 
PRINCETON UNIVERSITY 


Formulas are derived by which, given the factor loadings and 
the internal reliability of a test of unit length, the following esti- 
mates can be made: (1) the common-factor loadings for a similar 
(homogeneous) test of length n; (2) the number of times (7) that 
a test needs to be lengthened homogeneously to achieve a factor 
loading of a desired magnitude; and (38) the correlation between 
two tests, either or both of which have been altered in length, as a 
fuhction of (a) the new factor loadings in the altered tests or (b) 
the original loadings in the unit-length tests. The appropriate use 
of the derived formulas depends upon the fulfillment of four as- 
sumptions enumerated. 


The problem of ascertaining changes in common-factor load- 
ings as tests are altered homogeneously in length arises, particularly, 
when (1) short tests have been factor-analyzed and tests of greater 
length are wanted for more accurate measurement, (2) long tests 
have been factor-analyzed but, to keep a test battery within practi- 
cal limits, short tests are used to measure the contributions of factors 
to composite scores; and (3) in research, estimates of intercorrela- 
tions are needed for tests of altered length. 

A homogeneous alteration in test length is defined as one that 
maintains the same ratios among the variances and covariances in 
the common factors and the variances in the specific factors, if the 
common factors are correlated; or the same ratios among the vari- 
ances in the common factors and specific factors if the common fac- 
tors are uncorrelated. In conjunction with this definition, four as- 
sumptions are implicit: (1) that the test altered homogeneously in 
length is one that would be factor-analyzed within the same, or near- 
ly the same, battery of tests as the original test of unit length in 
order that the factorial composition would remain relatively invari- 
ant (i.e., in geometric terms, the test vector would be found in the 


*This article is based on a paper read by the authors at the annual meeting 
of the Western Psychological Association in Eugene, Oregon, June 25, 1949. 


237 








238 PSYCHOMETRIKA 


same common-factor space); (2) that the samples to which the al- 
tered test is administered would be from the same population as the 
sample for which the test of unit length was initially factor-analyzed ; 
(3) that the samples would be very large in order that any variations 
in factor pattern due to sampling error would be negligible; and (4) 
that the amount of correlation between the factors remains constant 
— an assumption which depends upon the fulfillment of assumptions 
(1), (2), and (3). 

We shall derive formulas by means of which, given the factor 
loadings and the internal reliability of a test of unit length, the fol- 
lowing estimates can be made: 

(1) the common-factor loadings for a similar test of length n; 

(2) the number of times (n) that a test needs to be length- 
ened in order to achieve a factor loading of a desired magnitude; and 

(3) the correlation between two tests, either or both of which 
have been altered in length, as a function of (a) the new factor load- 
ings in the altered tests; or (b) the original loadings in the unit- 
length tests. 

The following symbols will be defined: 


»,q— any two common factors; 
j,k= any two tests of unit length; 
7, — the obtained coefficient of correlation between 
sets of scores on tests 7 and k; 
Tpq== the amount of correlation between any two 
factors p and q; 
R; = the reduced correlation matrix of tests of unit 
length; 
R,= the matrix of correlated (primary) factors; 
F =the common-factor matrix for the battery of 
tests of unit length in which the entries in each 
row correspond to the factor loadings for a 
single test and in which the entries in each col- 
umn represent the weights in one factor for 
each of the tests; 
r = the number of common factors, the rank of R; 
and of F; 
t= the total number of tests in a battery; 
@jp and ax, = the respective loadings of Factor p in tests 7 
and k, each of unit length; 
Qinjyp ANA Acmx)p = the estimated loadings of Factor p in tests 7 
and k lengthened n and m times, respectively ; 











J. P. GUILFORD AND WILLIAM B. MICHAEL 239 


h?; and h?,; = the communalities, respectively, of Test 7 of 
unit length and of Test 7 lengthened n times; 
rj; 2Nd 7(nj) (nj) = the (internal) reliability of Test 7 before and 
after change in length; 
T (nj) (mk) —= the estimated correlation between scores on 
Test 7 lengthened n times and Test k length- 
ened m times. 


By geometric argument, it is apparent that if a test vector is 
lengthened or shortened, the projections of this altered vector upon 
any set of reference axes in the common-factor space will be propor- 
tional to those projections of the initial test vector on the same axes 
on the assumption that the reference axes are fixed. These axes may 
be taken to refer to any set of orthogonal reference vectors, to the 
normals to the hyperplanes bounding the common-factor space, or to 
the primary vectors, each of which is formed by the intersection of 
a different set of r—1 hyperplanes. 

This geometric interpretation of the proportional changes in fac- 
tor loadings arising from the homogeneous alteration of test length 
may be expressed algebraically as @(nj)) = ¢-@jp, Where c¢ is a con- 
stant of proportionality. In fact, the definition of a homogeneous 
change in test length may be formulated in algebraic terms as fol- 
lows: 

@* (njyp sins hn; _ Penjyini) __ Ce (1) 








jp Wh; V5; 
The communality h?; of the test of unit length is equal to 


r q-1 


Sar, +25 > Gp pq Aig (Q > p). 


q=2 jH1 
Similarly, the communality h,;? of the altered test is equal to 


r q-l 
> @ injyp + 2D D Qenjyp pq Unjyqg (Q>P). 
qQ=l p=1 

Substitution of ¢ - ajp and ¢ + djq for Qinjyp ANA Anjyq in the expres- 
sion for the communality h?,; will obviously reveal that the ratio 
h?,; to h?; is equal to c?, since each 1, is assumed to be fixed for any 
length of test. If the factors are uncorrelated such that each 7,, = 0, 
the product terms vanish. The ratio of h?,, to h?; is obviously still 
equal to c?. 








240 PSYCHOMETRIKA 


1. Estimating common-factor loadings for a similar 
test of lengthn 


The reliability 7(n;)(nj) Of a test lengthened n times in a homoge- 
neous manner may be estimated from an available coefficient of re- 
liability 7;; of an unaltered test through use of the familiar Spear- 
man-Brown formula, 








NY 55 
(nj) (nj) : 
ee (n = 1) V5; 
Substitution of this fraction for 7(nj)(nj) in the expression 
: gg  T(niy ni) 
A" (nj)p — A jp ’ 
Vjj 


which is derived from (1), yields upon simplification the following 








formula: 
n 
Q(njyp — Ajp . (2) 
1+ (n— 1) V jj 
or simply, 
Anj)p = A; Qjp» (2’) 
where 








i nN 
= 3 = (n—1) 55" 


a constant representing the ratio of the magnitude of the loading in 
a factor in a test of altered length to the magnitude of the loading in 
the same factor in a test of unit length. 

Thus, if the loading in a factor is .50 in a test of an internal re- 
liability of .76 and if such a test is doubled in length, the new factor 
loading an; is equal to 





| 2 
V1+ (2—1) .76 
The maximum loading in a factor that may be attained from 
lengthening a test is proportional to the square root of the reciprocal 


of the reliability. Thus, as » approaches infinity in (2), the limiting 
value for a factor loading under the specified conditions, is 


50 OF .Doe- 





(3) 


Q(2j)p = ——_ 








J. P. GUILFORD AND WILLIAM B. MICHAEL 241 


This expression is equivalent to a correction for attenuation. For 
the illustration presented, the maximum factor weight to be realized 
by using a test of infinite length would be about .573. Clearly, the 
less reliable a test, the more can be gained in size of factor loading 
with homogeneous lengthening. This is readily apparent in Figure 1. 

If more than one test within a given battery are to be length- 
ened or shortened, it is convenient to express the computational pro- 
cedure in matrix notation. The matrix consisting of the factor load- 
ings for each of the tests of unit-length within a given battery has 
been previously defined as F. If several of the tests are changed in 
length, while the remaining (unaltered) tests are retained within 
the battery, the new matrix which gives the factor weights in both 
the altered and unaltered tests may be denoted by F,,. 

This new factor matrix F,, may be readily found by premulti- 
plying the original factor matrix F' by a diagonal matrix D consist- 
ing of as many rows and columns as there are tests within each bat- 
tery: 

F,= DF. (4) 


Each entry in D, which corresponds to a given test, is a number 
representing the ratio of a factor loading in a test lengthened a cer- 
tain number of times to the loading in the same factor when the test 
is of unit length. Thus, in the case of a test 7 altered in length n 
times, the inserted entry in the jth row and the jth column of the 
diagonal matrix would be d;, previously defined in conjunction with 
Formula (2). Similarly, corresponding to a test k lengthened m 
times the entry in D would be designated d;, where d, is equal to 


™m 
a (m—1) rx 


Obviously if the length of a test is not altered, unity is substituted 
for the entry in the diagonal matrix corresponding to the test, since 
its factor loadings remain unaltered. 

For example, if two tests 7 and k in a battery consisting of ¢ 
tests were to be altered in length n and m times respectively, the pro- 
cedure indicated by Formula (4) for estimating the loadings in Fac- 
tors 1,2,---,p,-:+, 7 for Tests 1,2,---,7, k, ---¢ may be clari- 
fied by writing each matrix in (4) in detail: 

















PSYCHOMETRIKA 





|| as Are vse Aip coe Chir | 
| Q21 Q22 sa Qop see Aer | 
| } 
| Qinjy. Qnjy2 **° Qinjyp *** Aniyr I 
! Q(mk)1 Q(mk)2 puis Q(mk)p sie be Q(mk)r | 
Ces Ate vas Qtp Se / ee I 
|| 
F, 
| 1 0 0 0 0 | [| %11 Gia s+ Aip se Ahr |i 
| , 3 0 0 O || || der des ep lar } 
1. re ae Be ye aa 
=| 0 0 d; 0 0 | | Qj, Ajo jp Qjr | 
| 0 O 0 di, 0 | Aki Axe Qkp kr | 
| bis: ws aie } || ep 
0 0 0 0 i | | a. Wis Oey a 
|| I! 
D F 


In some instances, a computed coefficient of the internal reli- 
ability may be available only for the scores obtained from the length- 
ened or shortened test. It is therefore necessary to derive an esti- 
mate of the reliability of the test of unit length. Two approaches 
may be employed. The first is to consider the altered test as the one 
of unit length and to find the ratio of the length of the actual unit- 
length test to the test of changed length. Once this is done, the 
Spearman-Brown formula may be used to obtain the desired esti- 
mate. If the altered test is shorter than the test of unit length, the 
value for ” in the Spearman-Brown formula is greater than unity; 
if the altered test is longer than the test of actual unit length, the 
value for n is less than unity. 

Somewhat more direct, the second approach leads to a formula 
which may be applied in a routine manner. If the Spearman-Brown 
formula, 

NY 5j 
1+ (n—1) 7;;’ 





T (nj) (nj) = 


is solved for 7;; in terms of ” and 7(nj;)(nj) , an estimate of the desired 
reliability may be made: 

(nj) (nj) 
2(1 —Penjyingy) + Teajprenn ; 





i 














J. P. GUILFORD AND WILLIAM B. MICHAEL 243 


Substitution of this expression. for 7;; in (2) yields a method for 
estimating the factor loading in a test homogeneously changed in 
length when the factor loadings for the test of unit length are known 
and when the reliability of the altered test is available, but not the 
reliability of a test of unit length. Hence, (2) may be rewritten 





n 


Qinjyp = Ain | ape . (5) 
1+ (n—1) | (nj) (nj) | 


2(1 — Ping niy) F Tiniyini) 








As an illustration, suppose that in a factor analysis of a test 
battery the loading in a factor for an experimental test is .60 and 
that no reliability coefficient is computed. Homogeneous items are 
added to this test until it is three times its former length. Follow- 
ing the administration of the longer form to the same or to a com- 
parable sample, an internal reliability coefficient of .88 is computed. 
What would be the approximate size of the factor loading in the 
lengthened test? 

The factor loading @(nj)p is equal to 


3 
3(1—.88) + .88 | 


or about .67. Calculation of the value of the numerical expression 
within the brackets yields a reliability estimate of .709 for the test 
prior to its being lengthened. 

An over-all view of the relative amount of change in an origi- 
nal common-factor loading resulting from lengthening a test homog- 
eneously is shown numerically in Table 1 and graphically in Figure 
1 for different reliability levels of the original test (7;;) and for vari- 
ous ratios (n) of length of an altered test to that of an unaltered 
test. The particular range of values chosen for 7;; and » should 
suffice for most practical purposes. 

In Table 1 the percentage that the new factor loading is of the 
initial one in a test of unit length affords a means for computing the 
desired loading through the mere multiplication of the original fac- 
tor weight a;, by the proportion-of-change constant in Table 1 (ac- 
tually 1/100 of percentage values presented) relevant to given values 
for r;; and ». Linear interpolation may be employed for obtaining 
estimates of the percentage of change corresponding to those values 
of r;; and n intermediate to those presented in the table. Similarly, 
in Figure 1, graphical estimates of what the percentage a new factor 

















244 PSYCHOMETRIKA 


TABLE 1 
Percentage that a Common-Factor Loading in a Test is of the Original Factor 
Loading before the Test was Changed Homogeneously in Length n Times* 








Initial Reliability (7;;) 
80 





n -60 65 -70 75 85 90 95 
0.2 62.0 64.5 67.4 70.7 74.5 79.1 84.5 91.3 
0.4 79.1 81.0 83.0 85.3 87.7 90.4 93.3 96.4 
0.6 88.9 90.0 91.3 92.6 93.9 95.3 96.8 98.4 
0.8 95.3 95.9 96.4 97.0 97.6 98.2 98.8 99.4 
1.5 107.4 106.4 105.4 104.4 103.5 102.6 101.7 100.8 
2.0 111.8 110.1 108.5 106.9 105.4 104.0 102.6 101.3 
2.5 114.7 112.5 110.4 108.5 106.6 104.8 103.1 101.5 
3.0 116.8 114.2 111.8 109.5 107.4 105.4 103.5 101.7 
4.0 119.5 116.4 113.6 110.9 108.5 106.1 104.0 101.9 
5.0 121.3 117.8 114.7 111.8 109.1 106.6 104.3 102.1 





*Where the percentage is given by 100 fre ng . 

loading is of an original one may be made for points (geometric val- 
ues of 7;; and ) not lying on the curves presented. Inspection of 
Table 1 and Figure 1 reveals two facts: (1) the more a test is length- 
ened the greater is the relative amount of change in the magnitude of 
the initial factor loading, and (2) the relative degree of change in an 
original factor loading for a fixed amount of alteration in test length 
(constant n) is greater, the lower the magnitude of the reliability 
coefficient 7;; — a finding previously noted in conjunction with the 
maximum value for a factor loading given by Formula (3). 


2. Estimating the number of times a test needs to be lengthened to 
achieve a factor loading of a desired size 


Squaring both sides of the expression in (2) and solving explic- 
itly for n yields the formula 





at @? (njyp (15; we 1) 


2 om 2: 
A (nj)p Tjj — U jp 


(6) 


n 
Inspection of the formula reveals that for a psychologically mean- 
ingful solution in testing, n must be positive. Since for values of 7;; 
between 0 and 1 the numerator is negative, the denominator must 


also be negative to give a positive sign for n. Solution of the in- 


equality @? (nj)p 7;; — aj, < 0 for the term a?(n;), yields the result that 
2 


: Q"jp 
Q? (njyp 18 less than , or that, 
V5j 











J. P. GUILFORD AND WILLIAM B. MICHAEL 245 


oo (0 < Vij < 1) ° 





Qinjyp S 
VT ii 
This finding checks with Formula (3), which states that the maxi- 
mum factor loading to be attained in a test of infinite length (per- 
ip 


a 
fect reliability) is , 


Vii 
Hence, in the use of Formula (6), it is recommended that the 
maximum factor loading that .can be achieved be estimated first. 
However, if a factor loading is selected that is impossible of solution, 


the denominator, 
Ajp 
( dan» D> <e ) , 
"A Vii 


will be positive; and the sign of n , negative. 

Thus, if in a test of unit length 7;; —.80 and a;, = .50, it is 
desired to determine how many times a test must be lengthened to 
achieve a loading of .53, the value for m is found to be 


(.53)? (.80 — 1.00) 
(.53)? (.80) — (.50)?’ 
However, if one desires to increase the factor loading to a point such 
that one-half the variance in the test is accounted for by the factor 


(Q(njyp = -707), the solution will be impossible in psychological terms, 
as evidenced by a value of —.67 for n. The maximum attainable 


or .559. 





Or 2.22. 








factor weight is 


3. Estimating the correlation between two tests either or both of 
which have been changed in length in terms of the com- 
mon-factor weights of the original tests or 
of the altered tests 


In test theory the correlation between two tests may be ex- 
pressed as a function of the weights in factors common to the tests 
and the correlations among the factors. The correlation between any 
two tests 7, k in a battery may be obtained from the following ex- 
pression: 


Tix =D Ajp Axp +S DS Ajp pq Urq (DF Q), (7) 
pl : 


q=1 pl 








246 PSYCHOMETRIKA 


where the number of terms in the first summation product is 7, the 
number of common factors; and the number of terms in the second 
summation product containing the correlations between factors is 
r(r — 1). Representing in summational notation Thurstone’s fun- 
damental factor theorem (2) extended by Tucker to the general 
oblique case (3), Formula (7), may be written in matrix form as 


R;=FR,F’, (8) 


in which the symbols have been previously defined. 

If test 7 is lengthened in a homogeneous fashion, n times, and 
test k, m times, the correlation 7(nj)imx) between the two altered 
forms may be written through use of (7) as a function of either the 
new factor loadings in the altered tests or the original loadings in 
the unit-length tests. In terms of the new factor loadings the corre- 
lation is readily seen to be 

T (nj) (mk) = > Q(nj)p Aimkyp © = = Q(nj)p Vpq Uimk)q (p ¥q) ° (9) 

p=1 qQ=1 pl 

Substitution of dj aj, for dinj)p and Of dy xq LOY Aomeyq in (9) yields a 
formula for the correlation between the two altered tests expressed 
in terms of the initial factor loadings in the respective tests of unit- 
length: 

1 (nj) mk) = & A; Ajp Uk Ary + Y YJ A; Vip %pq Uk Aig (DAG). (10) 

p=1 g=1 pl 

or, more simply, 


T (nj) (mk) = dd; > jp Up 7 = p> Qin Tq ta | (p - q) ’ (11) 


—s 
pol qg=1 1 


If the factors are orthogonal (all 7,, = 0), Formulas (9), (10), 
and (11) reduce to 


r 


T (nj) (mk) — Dd Anjyp Vemkyp » (9’) 

p= 

i 
T (nj) (mk) — > d; Qjp dy Qxkp » (10’) 

pl 

or 
a 

¥ (nj) (mk) = 4; A S Ajp dnp. (11’) 


p=1 


If only one of the tests is lengthened, say 7, then the value for 
m associated with k is unity. The value for d; is also unity. 








J. P. GUILFORD AND WILLIAM B. MICHAEL 247 


If estimates of the intercorrelations among several tests within 
a battery that have been lengthened or shortened are desired in terms 
of the common-factor weights of the altered tests or of the original 
tests, it is expeditious to indicate, as in the first section, the opera- 
tions involved in matrix notation. If the symbol R,; is used to desig- 
nate the new matrix of intercorrelations of tests within a battery 
subsequent to changing the lengths of one or more tests, Formulas 
(9) and (11) may be expressed in matrix form: 


Raj es F, RF (12) 
and 
R,j = DF R,F'D. (13) 


If the factors are orthogonal (R,, = I, the identity matrix), 
Formulas (12) and (13) reduce to 


Raj =F a Fo (12’) 
and 
R,;=DFF'D. (13’) 


In practice, estimates of correlations of test scores obtained 
through use of these formulas may be somewhat low in view of 
the fact that in a given analysis not all common factors may be 
isolated. To the extent that not all common factors are known and 
included within Formulas (9), (10), and (11), underestimates are 
likely to occur. In general, however, the magnitude of these esti- 
mated coefficients should check fairly closely with the values obtained 
from the following formula: 





Y (nj) (mk) = VTeoh enh Fimio = Vik» (14) 
V1 ij TKK 
in which the meaning of each symbol has been previously defined. 
It may be readily shown algebraically that Formula (14) is 
equivalent to Kelley’s Formula 101 (1, 396), which in terms of the 
notation employed in this article may be expressed as: 





NM 7 jx, 


‘ (15) 
Vn+n(n—1)7;; Ym + m(m—1) rx, 





1 (nj) (mk) = 





Formula (15) may be rewritten readily as 


aii J : = (16) 
T (nj) (mk) = Tk e+ (n— 1) 7; 1+ (m — 1) te’ 














248 PSYCHOMETRIKA 


which in terms of the notation previously employed in Formulas 
(2’), (10), and (11) may be denoted simply as 


T (nj) (mk) = Aj Ax jx « (17) 


This formula is equivalent to (11) if all common factors have 
been isolated and if the usual assumption of no correlation between 
the three sources of test variance (common-factor, specific, and error 
variance) is fulfilled. 

It may be mentioned that Formulas (11), (14), (15), (16), and 
(17) are applicable for estimating the correlation between a test 7 
lengthened n times and a criterion measure k. If the criterion is of 
unit length, m is of course unity, and the formulas simplify consid- 
erably. 

In practice, however, it is obviously advantageous to employ Kel- 
ley’s formulas for estimating the correlation between two tests al- 
tered in length or the correlation between a lengthened test and a 
criterion measure. However, it is interesting to note that Kelley’s 
formulas are equivalent to those obtained when changes in the factor 
pattern of homogeneously altered tests are considered. 


REFERENCES 
1. Kelley, Truman L. Fundamentals of statistics. Cambridge: Harvard Univ. 
Press, 1947. 
2. Thurstone, L. L. Multiple-factor analysis. Chicago: Univ. Chicago Press, 
1947. 
3. Tucker, Ledyard. The role of correlated factors in factor analysis. Psycho- 
metrika, 1940, 5, 141-152. 


Manuscript received 6/27/49 
Revised manuscript received 11/21/49 








J. P. GUILFORD AND WILLIAM B. MICHAEL 249 


























ow 
° £ SB aa 
© ° : 
" " “0 ou 
= — = pes. aaa. 
= i a - & = 
i | wo Lom 
= 
WwW 
a 
a 
.-§ 
= 
4 
ec 
oO 
oO 
pe 
Tor 
an 
uJ 
— 
w& 
Oo 
P= 
= 
oO 
a 
uJ 
al 
mo s 
WW 
2 
w 
re) 
° 
e 
4 
\ x 
~ 
N 
\ = 





Uf 
































@ 
ae S S_& 
° ° ° 9° ° 
a = 2) cS © © 


3NO “IWILIN! JO SI ONIGVOT YOLOVS MIN 3OVLN30Y3d 


FIGURE 1 
Percentage that a Common-Factor Loading in a Test of (Homogeneously) 
Altered Length Is of the Original Factor Loading in a Test of Unit Length for 
Various Levels of Reliability. 














PSYCHOMETRIKA— VOL. 15, NO. 3 
SEPTEMBER, 1950 


A TEST OF THE EQUALITY OF STANDARD 
ERRORS OF MEASUREMENT* 


BERT. F. GREEN, JR. 
EDUCATIONAL TESTING SERVICE 


A procedure is proposed for testing the significance of group 
differences in the standard error of measurement of a psychological 
test. Wilks’ criterion is used to assure that the tests used in ascer- 
taining reliability and hence variance of errors of measurement 
may be assumed parallel for each group. Votaw’s criterion may be 
used to check whether the test scores of all the groups have the 
same mean, variance, and covariance. It is possible, however, for 
the variance and reliability of the test to differ widely from group 
to group, so that Votaw’s criterion is not satisfied even though the 
variance of errors of measurement stays relatively constant. For 
this case a modification of Neyman and Pearson’s criterion is de- 
veloped to test agreement among standard errors of measurement 
despite group differences in mean, variance, and reliability of the 
test. 


When a psychological test has been given to more than one 
group of individuals, it has a standard error of measurement for 
each group. A criterion is needed to determine whether the ob- 
served differences of the standard errors are significant, or whether 
they could reasonably be ascribed to the chance fluctuations of a 
single parameter. The proposed criterion is an application of the 
maximum likelihood criterion of Neyman and Pearson (4) for test- 
ing the homogeneity of variances. (See also Bartlett (1).) 

It has usually been assumed that the standard error of measure- 
ment would not be affected by a change in group heterogeneity. This 
has been used by Kelley (3), Otis (5), and others to demonstrate 
the relationship between changes in variance and changes in reli- 
ability coefficient. The test discussed below will provide the basis 
for an experimental check, in any particular case, of the assump- 
tion that the standard errors of measurement are equal. 

The following notation and definitions have been used. 


g = subscript designating group g, 
k= number of groups, 
i= subscript designating individual 7, 


*The author wishes to acknowledge the helpful criticisms of Dr. Harold 
Gulliksen, who suggested the problem. 


251 








252 PSYCHOMETRIKA 


nm, = number of individuals in group g , 
k 
N= > n,. 
g=1 
For each group we define: 


x; = score on test x of individual 7, 
yi = score on test y of individual 7, 


Ui=— Xi — Yi 
1x 1 
f= — Dui, J=— ZU, 
Ng i=1 Ng i=1 
1s 1% : 
8° = — > (x41 — #)?, $,° = — > (i — 9)’, 
Ug i=1 g i=1 
—— bo (z;— 2) (y¥i—J%), 


g i=1 


8.7 = 8? +387 +2 Cy, 
_ i 1 Ng 
t= 8)? + 87 —2 Cy + (4—Y)?, =~ Sut): 


For the combination of the k groups under discussion, we define: 


1 k 
P= Vz Noto» 


g=1 


7 k 
— 2 fs = \2 
A Nz ley + bo + Uo)", 
1 k 
B=yzm (Z, + Gy). 

The variance of the errors of measurement will be referred to 
in this article as error variance. No ambiguity can arise as we are 
not dealing with any other type of error in the present article. This 
error variance is usually defined as 


k?=S? (1—R), (1) 


where S? is the variance of the test and R is the reliability co- 
efficient. Since this definition involves hypothetical tests and true 
scores, and since it is difficult to deal with statistically, an alterna- 
tive definition will be used, based on more immediately available in- 
formation. 

It has been observed that under certain assumptions, the vari- 
ance of the difference scores, u;, is twice the error variance. Jack- 








BERT F. GREEN, JR. 2538 


son and Ferguson (2) point this out and then consider certain ap- 
plications of the analysis of variance to this problem. This equiva- 
lence is the basis of the procedure discussed below. It is assumed 
that for each group, the difference scores u; form a sample from a 
normal distribution with mean 0 and variance o,’?. The assumption 
of zero mean follows from the requirement that the tests be parallel. 
The maximum-likelihood estimator of o,? is t,. For the case where 
the error variance is to be estimated from two halves of a test, the 
error variance of the scores (x; + yi) for group g is defined as 


= %,. (2) 


In the case where parallel forms are used, the error variance of the 
scores x; for group g is defined as 


e? = tt,. (3) 


It will be seen that (2) and (3) reduce to (1) when s,? = s,? and 
& = ¥, ie., when the definition of parallel tests is exactly fulfilled. 
Otherwise, (2) and (38) are slightly larger than (1) and thus larger 
than the error variance as it is usually defined. 

In order to compute the error variance, parallel tests or parallel 
half-tests are necessary. The statistical criteria developed by Wilks 
(7) should be used to determine whether the tests may be assumed 
parallel. His hypothesis Hm», that the means are equal, the vari- 
ances are equal, and (for more than two tests) the covariances are 
equal, is equivalent to the hypothesis that the tests are parallel. His 
corresponding criterion Lm,- should be applied to tests x and y be- 
fore proceeding to the tests for homogeneity of variances. For the 
case of two tests, for each group g, Hmv- is the hypothesis that 
= ¥Y (=™m) and s,?=— s/ (=v). This hypothesis is tested by the 
criterion Lm». which in this case reduces to 

2 | 2 
Lave —_ a 

4b Sw 


For large n,, G = —,logeLmye has a chi-square distribution, with 
two degrees of freedom. 

Since the parallel nature of the tests depends upon the group 
being tested, the above criterion should be applied to tests x and y 
for each of the k groups. If Hm, is rejected for any of the groups, 
nothing more can be done with these groups until parallel tests are 
available. For the groups for which Hm, is accepted, the criterion 
discussed below can be used to test the equality of the error vari- 
ances. 








254 PSYCHOMETRIKA 


For those groups for which we accept the hypothesis that the 
tests are parallel, we may consider the further hypothesis that the 
parallel test scores of the & groups are all samples from the same 
normal multivariate distribution. For the case of two tests and k 
groups, the hypothesis is that 


M, = Ms. =-+- = MM, == ™M|, 
Vy Vqg Free SH VUy SH = UE, 
Coy, = Coy, =**** = Coy, = ++" = Coy, « 


Votaw (6) has developed a criterion for this hypothesis which in 
this case reduces to 


k 
IT (3t,s° )" 





g=1 
L’ mvc — . 
MVC/ iT (A — B?)* 
For large n, , G’ = —log-Lurve/mve has the chi-square distribution with 


3(k—1) degrees of freedom. 

A logical second step in checking the error of measurement is to 
apply Votaw’s test. If this hypothesis is accepted, we may conclude 
that the parallel tests probably have the same mean, variance, and 
covariance for all the groups; in which case it is reasonable to as- 
sume that the test in question has the same error variance for all 
the groups. However, it is quite possible for the groups to have the 
same error variance without being samples from the same bivariate 
distribution. This is often the case when the groups differ widely in 
average ability or range of ability. The test might measure all the 
groups with about the same accuracy, yet the means, variances, and 
reliabilities of the test scores could differ widely from group to group. 
In such cases a less restricting criterion is needed. 

By considering the distributions of the difference scores, u;, we 
may develop a criterion for testing the hypothesis of equal error vari- 
ances. It should be noted that this criterion is valid in itself for test- 
ing the homogeneity of variances o,?. These variances may be con- 
sidered due to errors of measurement, i.e., error variances, only if 
the parallel nature of the tests is established. Under the assumption 
that for each group the u; values are normally distributed with zero 
mean and variance «,?, we wish to test the hypothesis that 


o? = 0,2 = ++: 0,7 =-+-=o;”. 
In the parallel form case we wish to test the equivalent hypothesis 


40,7 = $0.2 = --- = 40,7 =--- = ho,” 














BERT F. GREEN, JR. 


The maximum-likelihood ratio for testing this hypothesis is 


Ng 


=u (7): 
_ ae T )° 


For large values of n,, (say n, > 50) 


G’ = : ] ty 
mee ~ NglOSe T 
has the chi-square distribution with k—1 degrees of freedom. The 
hypothesis is rejected if the value of G” is greater than the value of 
chi-square with k—1 degrees of freedom at some confidence level, as 
5% or 1%. 

It is important to note that the criterion is applicable only to in- 
dependent groups. If there are any overlapping groups, these groups 
must be redefined so that no individual score or pair of scores is in 
more than one group. 

To illustrate the application of the criterion, a 40-item test of 
reading comprehension was used. This test had been given experi- 
mentally to three different groups, using separately timed 20-item 
halves to obtain parallel half reliability.* 


From these data the following values were calculated. 
Basic data, N = 240 


Group I Group II Group III 
N, 45 106 89 
2 13.2889 11.8679 9.7416 
y 12.6222 11.0377 9.5955 
8,2 15.7610 18.7608 13.7422 
8,2 13.3017 10.0363 9.0049 
c 9.1814 6.4106 4.8505 


cy 


From these figures, further values were computed. 


Group I Group II Group III 
r .6306 5455 .4360 
R 1735 -7059 .6072 
8,,? 47.3255 36.6183 82.4481 
8,,2(1— R) 10.7192 10.7694 12.7456 
t 11.2444 11.0849 13.0674 


Here r is the correlation between the two half tests, and R is this 
correlation corrected by the Spearman-Brown formula. Note that for 


. *The data were supplied by Dr. John W. French of the Educational Testing 
ervice. 

















256 PSYCHOMETRIKA 


the total test the reliabilities, R , differ considerably, as do the vari- 
ances, s,2; but that the error variances s,,7(1—R) or t, are quite simi- 
lar from group to group. Furthermore, the differences between 
8,2(1—R) and ¢ are slight, indicating that the use of ¢t as the error 
variance is reasonable. 

At this point Wilks’ test was applied to each group. 


Group I Group II Group III 
8,” 8,2 — ¢,,? 126.2656 97.0117 100.2198 
1/4t s,,? 133.0367 101.4778 106.0027 
i. 9491 .9560 9454 
G 2.3535 4.7700 5.0018 


50 > P > .30 10>P > .05 10> P > .05 


where P is the probability of the chance occurrence of a chi-square 
of size G with 2 degrees of freedom. Since in each case P is greater 
than .05, we may assume that the tests are parallel for each group. 
Votaw’s test was next applied. 


Group I Group II Group III 
8,7 + (x+y)? 718.7106 538.6292 406.3715 


A = 523.3489 
B= 21.9250 
T= 11.8500 
%t 8,2 
4T(A — B?) 
G’ = 36.4981 
P< OL 


1.0531 8033 8391 


Here, P is the probability of a chance occurrence of a chi-square 
value of size G’ with 6 degrees of freedom. We may conclude that 
the three samples are not all from the same population of test scores. 
Finally, the test for the equality of error variances was applied. 


9489 9354 1.1027 


Bae 


G" = .7391 
nO >) > sd. 


Here, P refers to the chi-square distribution with 2 degrees of free- 
dom. We may conclude that the test probably has the same standard 
error of measurement for the three groups. Thus we have shown 
that the half-tests are parallel in each group, that the three groups 
are not samples from the same bivariate distribution of half-test 

















BERT F. GREEN, JR. 257 


scores, but that the standard errors of measurement are the same for 
all three groups. 

The procedure outlined above is a step-by-step method for test- 
ing the homogeneity of error variances. Wilks’ test is used to assure 
that the tests are parallel. Votaw’s test is used, if the tests are par- 
allel, to check whether the groups might all be samples from the same 
bivariate distribution, i.e., have the same mean, variance, and covari- 
ance. A modification of Neyman and Pearson’s test is used, if the 
tests are parallel, to ascertain whether the error variances are the 
same within the limits of sampling error. Votaw’s test is discussed 
to emphasize the fact that the groups may differ widely in mean, 
variance, and reliability coefficient on the test, but that the standard 
errors of measurement might be the same for all the groups. 


REFERENCES 

1. Bartlett, M. S. Properties of sufficiency and statistical tests. Proc. Roy. Soc., 
Series A, 1937, 160, 268-282. 

2. Jackson, R. W. B., and Ferguson, G. A. Studies on the reliability of tests. 

Bulletin No. 12, Department of Educational Research, University of Toronto, 

Toronto, Canada. 

Kelley, T. L. The reliability of test scores. J. educ. Res., 1921, 3, 370-379. 

4. Neyman, J., and Pearson, E. S. On the problem of k samples. Bulletin de 
UV Academie Polonaise des Sciences et des Lettres, Série A, Sciences Mathe- 
matique, 1931, 460-481. 

5. Otis, A. S. A method of inferring the change in a coefficient of correlation 
resulting from a change in the heterogeneity of the group. J. educ. Psych., 
1922, 13, 293-294. 

6. Votaw, D. F., Jr. Testing compound symmetry in a normal multivariate dis- 
tribution. Ann. math. Stat., 1948, 19, 447-473. Also Ph.D. Thesis, Princeton 
University. 

7. Wilks, S. S. Sample criteria for testing equality of means, equality of vari- 
ances and equality of covariances in a normal multivariate distribution. 
Ann. math. Stat., 1946, 17, 257-281. 

8. Wilks, S. S. Mathematical Statistics. Princeton: Princeton University Press, 

1943. 


co 








PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


THE RELIABILITY OF SPEEDED TESTS 


HAROLD GULLIKSEN* 
EDUCATIONAL TESTING SERVICE 


as compared with rescoring answer sheets. 


1. The Reliability of a Pure Speed Test 


the reliability of a power test is given by 





K 
8," 
Pe pati 
“ K-1 ae fj * 


where 


Ryx= the test reliability, 
K= number of items, 
8,2= (p, — po?) = pg = variance of item g , 


[note that Ss,? = Spq = pal, 
S,?= test variance. 


259 











Some methods are presented for estimating the reliability of a 
partially speeded test without the use of a parallel form. The effect 
of these formulas on some test data is illustrated. Whenever an 
odd-even reliability is computed it is probably desirable to use one 
of the formulas noted in Section 2 of this paper in addition to the 
usual Spearman-Brown correction. Since the formulas given here 
involve the mean and the standard deviation of the “number un- 
attempted score,” a method is given in Section 4 for computing this 
mean and standard deviation from item analysis data. If the item 
analysis data are available, this method will save considerable time 


In the case of a pure speed test it is generally agreed that a 
good estimate of reliability can be obtained only by administering 
two parallel forms. However, since it has been possible to obtain a 
useful lower bound for the reliability of a power test from only the 
mean, the standard deviation, and the number of items, it may be 
worth while to investigate similar possibilities for a pure speed test. 

It has been shown (1), (2), (3) that a useful lower bound for 


where p, = proportion answering item g correctly 


*The author wishes to thank Dr. Robert L. Thorndike, Dr. A. Paul Horst, 
and Dr. William G. Mollenkopf for reading the manuscript and making valuable 
suggestions regarding revision. 











260 PSYCHOMETRIKA 
For any test the variance of the errors of measurement is given by 

E,*=S,? (1—Rxx). (2) 
From (1) and (2) we have 





K 1 K 
i 2— DD sceams cataitemmmendiiaiin _——— 2 
f=Set—za| & 3s" | (3) 
or an equivalent form, 
K 3s,” 1 
E 2— nad a \ 4 
sop Ss: (4) 


Since Equation (1) gives a lower bound for the reliability, 
Equation (3) or its alternate form (4) is an upper bound for the 
variance of the errors of measurement. For a power test the term 
>s,? can be computed as a function of the item difficulties. That is, 
for items which are scored 1 or 0 the item variance is given by 

Sy” = Dy — Dy’. (5) 
Summing, we have 
DSy" — Po — =P’. (6) 
If we make use of the fact that the mean is equal to the sum of the 
item difficulties, we have 
Mx=23D,. (7) 


Using the variance of the distribution of the item difficulties (s,*) to 
obtain an expression for Sp,?, we have 








or 
=p,? = Ks,? + ae (9) 
Substituting from (7) and (9) in (6) gives 
>s,? = Mx Ee — Ke, (10) 


or an alternate form 























HAROLD GULLIKSEN 261 


If the items do not vary greatly in difficulty, we may set s,? 
equal to 0 in Equation (10) and substitute the result in Equation 
(1) to obtain an estimate of reliability. The resulting equation gives 
the reliability of a power test as a function of the mean, standard 
deviation, and number of items. 

The line of reasoning indicated above may readily be applied 
to a power test which has a fixed number of items so that K can be 
readily determined. For a speed test it is difficult to find an appro- 
priate value for K. In the case of a speed test with more items than 
can possibly be answered by any one subject it might be argued that 
K can be taken as equal to the number of items answered by the per- 
son answering the largest number of items. However, in this case 
one of the quantities entering into the equation for reliability would 
be determined from the behavior of only one person. This seems 
unsatisfactory. It is possible, however, to write a lower bound for 
the reliability purely from the mean and standard deviation of the 
test. From Equation (10) My is larger than Ss,?. Substituting Mx 





for Ss,? and setting K-1 equal to unity in Equation (1), we have 





Mx 
R>| 1—- : 12 
cd a2) 
and, using this inequality in Equation (2), we find 
Ex? < M;. (13) 


If it is possible to devise a reasonable estimate for the value of K, 
we may omit only the last term of Equation (10) and substitute in 
Equation (1), giving 





K a r 
R>= | 1-* | (14) 
and from Equation (4) 
_p KM;,—M,? 1 
2< ated 2 
" -. 2a" na 


It may be that these Equations, (12) to (15), will in some cases 
be useful for getting estimates of the reliability of a pure speed test. 
However, since the corresponding equations for the power case have 
generally given fairly low estimates of reliability, it is doubtful that 
these estimates will be useful. Nevertheless, such methods may be 
very useful in estimating the reliability of a power test which is 











262 PSYCHOMETRIKA 


slightly speeded (as most of our power tests are in order to make 
them practicable for use with large groups). Let us now turn to 


this problem. 


2. The Reliability of a Power Test Which Is Slightly Speeded 
In general, where one has several sources of error in a test the 
total variance of the errors of measurement is equal to the sum of 
the variances of the part errors (1). Specifically, we may write 


Z=2F +7, (16) 
where 


Z = total error score, 
Y = number of items answered incorrectly or skipped, and 
U = number of items unanswered at the end of the test. 





Then 
2° = Si? (1—Rzz). (17) 
From the usual formula for the variance of a sum, 
Sz? = Sy? + Sy? + 2rySySv. (18) 
The reliabilities of Z, Y , and U are related by 
re Sy’Ryy + Sone + 2ryvSySv (19) 
Z 


Substituting from Equations (18) and (19) in Equation (17) and 
simplifying gives ; 
E,? = Sy? + Sy? — Sy*Ryy — Sv*Ruv . (20) 


Using Equation (20) and the usual definition of the variance of er- 
rors of measurement, we find 


E,*= Ey? + Ey?.* (21) 


When a parallel form is used to compute the reliability of such a 
test, E',? will be different from zero. Where one uses an odd-even 
reliability on the same form, E,? is forced to be essentially zero. The 
problem of obtaining a good estimate for the reliability of a par- 
tially speeded power test is essentially the problem of obtaining a 
good estimate of E,’. 

*It is assumed that for any two parallel tests i and j, 


r =e. =a a= i. 
¥,U; ¥,", ¥,7, 7,9, 








HAROLD GULLIKSEN 263 
From the general relationship between error variance and re- 
liability, we have 


= Error —. (22) 
Total variance 





k= 


As indicated previously, when Rzz is a stepped-up odd-even reliabil- 
ity Ey? is forced to be essentially zero; hence from Equation (21) we 


may write 
Ey’ ts 2 (1—Rzz). (23) 


Substituting from Equations (21) and (23) in (22), we have 





S2? (1— Rzz) + Ey’ 
— z” ( a i (24) 
Sz” 
Simplifying Equation (24), we obtain 
E 2 
R= Razz —-—., (25) 
S2’ 


Using Formula (25) we will have several different corrections for 
the stepped-up odd-even reliability (Rzz), corresponding to different 
values taken as an upper bound for E,’. 

If the test is only slightly speeded, the total variance of the U 
scores may be taken as a reasonable upper bound for the error vari- 
ance of the U scores. We may thus write 








Ey? = Sv’, (26) 
from which we have the estimate of reliability 
SS 2 
Rs = Rzz — = > (27) 
Sz 
From the inequality (13) we may also set 
E,?=Mu, (28) 
from which we have a second estimate for reliability, 
M 
Ry = Rzz— —, (29) 
S;? 


If it is possible to obtain some reasonable approximation for the 
“number of items” K in the speeded part of the test, we may use 
Equation (10) as an estimate of Ss,”. 

It should be noted that Equations (10) and (11) for Ss,? depend 











264 PSYCHOMETRIKA 


upon Equation (5). Equation (5) will hold when the total score is 
a sum of units that are scored either 1 or zero; hence Equations (10) 
and (11) will apply to the U-scores as well as to the number cor- 
rect or number incorrect scores. Deleting the last term of Equation 
(10) and using Eq. (15) yields another upper bound for the variance 
of the U-scores, 


mene KM, — M,? — Sv? 
— K—1 


U 





, (30) 


from which we may write 


KM, — M,? — S,;’ 
Rx = Riz (K—1)S? (31) 
Equations (26) to (31) give three possible corrections for Rzz, the 
odd-even reliability of a partially speeded power test. These correc- 
tions are in terms of the mean, the standard deviation, and the num- 
ber of items for the U score, where U is the number of items unat- 
tempted by each person. It should be noted that Rx > Ry > Rs. 





3. Application of Formulas for Reliability of a Partially 
Speeded Power Test 


Table 1 gives data from 12 pairs of tests showing the relation- 
ship between the corrected odd-even reliability, R,,, the correlation 
of “parallel forms,” R,., and the corrections, Rs and Ry, given by 
Formulas (27) and (29). 

Tests A (WAL-1, WAL-3)* are Synonyms and Sentence Com- 
pletion tests. Tests B (WAL-2, WAL-4), D (WLSX4-5-1, WLSX4-5- 
2), and E (WLSX2-4-1, WLSX2-4-2) are Paragraph Reading tests. 
Tests C (WLSX1-1-1, WLSX1-1-2) are Analogies tests. Tests F 
(WLSX4-2-1, WLSX4-2-2) and G (WLSX3-2-1, WLSX3-2-2) are 
paragraphs with Arguments. Tests H (WCM1-5, WCM1-6), I 
(VCM1-3, VCM1-4), J (VSA1-3, VSA1-4), and K (WCM1-3, 
WCM1-4) are Mathematics tests. Tests L (WLSX4-3-1, WLSX4-3-2) 
consist of paragraphs for each of which the student is required to 
choose the main idea. 

From inspection of Table 1 it may be seen that where S,? is 
less than three-tenths of the variance of the test, Rs is not unreason- 
able. Where the test is more speeded than this, Sp? gives an unsatis- 
factorily large estimate for the error variance of the “speed compo- 
nent.” In some of these cases M, seems to be a satisfactory estimate 


*These code letters serve to identify the test form and population used. 








“SeynUlUL Ul SUIT} Surysayt 
*sulo}yI JO sequinyyL 
"21GB S14} Ul [VLIeZBUI 94} Bulsedeid 10} uossJaAIOY, use M “IL YUBY OF SOYSTM IOYING suUT, 








TLV% 06L° 8so°V $68°S ST “ o-E-VX SIM 
vrs's v9o't OSL'L Lg9°F ST 3 "s -e-rxstm 
9LP'S LLL’ LEs°ST LI9°6T 0g b-TWOM 


008 
80L'y ste't L9v°0Z oEa°st 0g ° e-TWOM 


069°9 LETT vVIZ"0e 08°92 0¢ 00 P-IVSA 
Le"9 Le6° 338°83 S9L°08 0s &-IVSA 
0s6'°s 098° b26°ST 9TS°8T 0g 003 b-TWOA 
Ove'V rai) oe § 066°9T Tor’st 0g &-TWOA 


906° 928° 198% LEL’S 9-TWOM 
vaT'T 836° 8926'S L80°¢ 0€ S-TWOM 
89° LLg° 8sl'y TLL : S-G-EX SIM 
STO'T bog" 9PL'v €2r'8 SZ T-3-EXSTM 
T89° T68° L99°% T93°9 SZ S-3-VXSIM 
3c8" Ley" 029°% SLL’ 9% T-o-VXS7IM 
909° LZo* 669°E Loe" 9T 9% S-b-SXSTIM 
60T'T 98° Ors"L T88°2S hd T-P-oXSTM 
ssp TLT’ Lg6'T oSP it 02 o-S-PXSTIM 
889° 602° v82°s ePL’st 02 T-S-PXS7IM 


Z 
a 
wm 
i 
— 
= 
= 
=) 
o 
Qa 
=) 
z 
= 


€8L° 9gt° ose'p 6L9°L2 st 
p60°T LI" v06°S 622° LZ ST 


o-T-TXS7IM 
T-T-TXS'IM 
0ga'T o06T" SOT'TT oTs"s8¢ 09 
vE6° sgT° vZ1'6 89°69 09 


vrIVM 
or IVA 
69° 080° 68L°9 882°SET 0€ 
96r° 880° 9L9°6LT 0€ 


229 


SIVA 
TvIVM 








ay Ig tL 4M 


24g 











48}S0], JO sueg sajamy, ur “a pue Sg soyewmnysq” pu 
‘SOIqTIGVljey SWAO4-[e[[Vaed ‘SatpI[Iqel[ey UsA-ppO pezde1109 jo uosiredwuo0g 
tT WIadVvas 

















266 PSYCHOMETRIKA 


of the error variance. 

It should be noted that test pairs B, D, E, F, and G are reading 
tests consisting of several passages with several items on each pas- 
sage. The odd-even reliability was computed on odd-even items, while 
the correlation between tests represents agreement between passages. 
Thus it seems reasonable to assume that R,, and R,,. represent differ- 
ent aspects of reliability. Hence the “lower bound” given by Ry is 
greater than R,.. It seems clear in such cases that if one wishes to 
estimate a parallel-forms reliability, odd and even paragraphs should 
be used instead of odd and even items. It should also be noted that 
D1 and D2 are not parallel forms so far as the speed segment is con- 
cerned. The U-score for Test D1 has a slightly larger mean and a con- 
siderably larger standard deviation than does D2. It seems reasonable 
to suppose that the students who saw how little they had finished on 
D1 speeded up their performance on D2 and hence lowered the stand- 
ard deviation attributable to unfinished items at the end of the test. 
A similar situation occurs in Tests El and E2 as well as in L1 and L2. 
Whatever the reason, it seems clear that we cannot regard these three 
pairs as parallel tests for both the power and speed aspects. This 
factor also may cause the odd-even correlation to be higher than the 
correlation between parallel forms. 

We see also that for test pairs C, H, and I, the parallel-form cor- 
relation is higher than the stepped-up odd-even correlation. One pos- 
sible explanation would be that the tests were composed of heter- 
ogeneous items carefully matched between tests and also steeply grad- 
ed in difficulty. 

From the foregoing observations, it seems that an experiment 
which is designed to test agreement between parallel-forms and split- 
half estimates of reliabilities must use carefully matched tests and 
equally carefully matched halves of tests. For example, the items 
could be set up in quadruples matched for content, difficulty, and 
item-test correlation; these would then be assigned randomly, one 
to each of the four “parallel” half tests. 

If the tests are to be speeded, the items in each of the quadruples 
previously mentioned must also be matched for “time to solve.” Pre- 
cautions must be taken to see that the subjects settle down to a rela- 
tively constant pace. For example, it may be necessary to use four 
matched tests in the hope that the last two will show relatively simi- 
lar speed characteristics. The “‘parallelness” of the resulting sets of 
test scores can then be tested by the methods given by Wilks (4). 

It is suggested that when reliability is estimated from a single 

















HAROLD GULLIKSEN 267 


form of a test, My always be calculated and reported along with the 
test score variance. If the ratio M,/S,? is greater than some value 
such as .2 or .8, the test is probably so speeded that a satisfactory 
reliability can be obtained only by use of a parallel form. 


4, Computing My and Sy? from Item Analysis Data 


It is possible to obtain the mean and variance of the U scores 
from item analysis data which give for each item the number of per- 
sons answering correctly, the number answering incorrectly, and 
the number not reaching the item.* 


Let us use K to designate the number of items in the test and w, 
to designate the number of persons who did not reach Item g. Then 
Wy. = Wy, Since all persons who did not reach Item g did not reach 
any subsequent item. We way write 


N K 
2 Ui=XwW; (32) 
4=1 g=1 


i.e., the sum of all the “unattempted” scores may be obtained by sum- 
ming first over persons and next over items or vice versa. Therefore 





N K 
SU, Sw, 
a (338) 
- N 


In order to obtain the standard deviation of the “number unat- 

tempted” score, we will use the usual formula for standard deviation, 
N 

=U 


N 


Since M, is given by (33), we know all the terms in this equation ex- 
cept SU. Let us use n, to indicate the number of persons making 
an “unattempted” score of U; then 





— My’. (34) 


$,?7 = 


*If appropriate item analysis data are available, this method is more rapid 
than re-scoring answer sheets. 











268 PSYCHOMETRIKA 


NN, =We —Wrr 

No —Wr-s — Wr-2 

Nz —Wr-2 — Wk-s 

Ny, = Wr-ui — Wen (35) 
Nk-2 = W3 —We 

Nk-1 = We —W, 

Nk —W, 


Many of the terms in (35) will be zero, since all the persons will pre- 
sumably attempt many of the earlier items in the test. 

In order to obtain SU?, it is necessary to multiply the first fre- 
quency by 1’, the second by 2%, and so on. The sum of the resulting 
products is SU*. Using Equation (35) to write this sum of products, 
collecting similar terms, and simplifying, we have 


DSU?=1 we +3 Wea t+ 5 Wee t-s 
+ (2u—1) We-wa + (2U +1) Wea t--> (36) 
+ (2K —5) w, + (2K —3) we + (2K—1) w,. 


The sum of this series may be written 


N K K 

> 0U?7=23> uwwz-, +d wWe-n- (37) 

=1 u=0 u=0 
The summation begins with uw — 0 because the first term is wx; i.e., 
Wx-. Where u = 0. For the sake of completeness, the summation is 
indicated as extending to K , but in any computational problem many 
of the terms will be zero and can be omitted. Substituting from (33 
and (37) in (34), we have the solution 


82= (1/N) 2S uwrszt+ My — M,;’, (38) 


where 
Sy is the standard deviation of the “number unattempted” 
score, 
Wx-. is the number of persons not reaching the (K—x)th item, 
N _ is the number of persons, and 
M, is given by Equation (33). 








HAROLD GULLIKSEN 269 


By using Formulas (33) and (38), M, and s, may be calculated 
directly from item analysis data which show the number of persons 
not reaching each item. These formulas will enable one to avoid the 
labor of rescoring the answer sheets in order to obtain M, or s,. 


REFERENCES 

1. Guttman, Louis. A basis for analyzing test-retest reliability. Psychometrika, 
1945, 10, 255-282. 

2. Jackson, R. W. B., and Ferguson, G. A. Studies on the Reliability of Tests. 
Bulletin No. 12, Department of Educational Research, University of Toronto, 
Toronto, Canada. 1941. Pp. 182. 

8. Kuder, G. F., and Richardson, M. W. The theory of the estimation of test 
reliability. Psychometrika, 1937, 2, 151-160. 

4, Wilks, S. S. Sample criteria for testing equality of means, equality of vari- 
ances, and equality of covariances in a normal multivariate distribution. 
Annals of math. Stat., 1946, 17, 257-281. 


Manuscript received 10/26/49 








PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


THE DISCRIMINATION OF TWO RACIAL SAMPLES 


PAUL HORST AND STEVENSON SMITH 
UNIVERSITY OF WASHINGTON 


Nineteen different anthropometric measures were obtained on 
all members of each of two racial groups. A procedure was devel- 
oped and applied to the data to give maximum differentiation be- 
tween the groups. The method is applicable wherever we have a 
large number of independent variables and a dependent variable. In 
such cases, the conventional methods for determining multiple re- 
gression constants are very laborious. An iteration method is pre- 
sented which is more rapid than any with which the writers are 
familiar. The method selects in sequence those variables which to- 
gether yield the largest multiple correlation with the criterion. At 
each step in the procedure, rapid estimates of the regression weights 
and the multiple correlation at that point are available. 


1. Introduction 

This paper deals with differences in patterns of body dimensions 
between men of Japanese ancestry and men of mixed Caucasian an- 
cestry. It is well recognized that the measurement of a single dimen- 
sion serves but poorly to differentiate samples of any two groups of 
diverse geographic origin because of the considerable overlapping of 
the two score distributions. But we obtain increased dimorphism of 
the combined distribution when we summate for each individual a 
number of trait scores which are weighted according to the signifi- 
cance of the group differences. The statistical technique here used 
was devised by Paul Horst, who also supervised the computations. 
The measurements were made by Stevenson Smith.* 

The next section of this paper contains descriptions of the sub- 
jects, the measurements, and the procedures used in making the meas- 
urements. The computations involved in obtaining the discrimina- 
tion index are described in detail in Section 3. A final section pre- 
sents the mathematical derivation of the procedure. 


2. Subjects and Measurements 


The subjects used in this experiment were fifty U. S. Army en- 
listed men of pure Japanese ancestry who were born in the Hawaiian 
*The labor cost of making the computations was met by the Anderson Re- 


search Fund of the University of Washington. A grant-in-aid from the Ameri- 
can Philosophical Society was employed in gathering the data. 


271 








272 PSYCHOMETRIKA 


Islands and fifty Caucasian enlisted men, twenty-five of whom were 
Island residents and twenty-five of whom were residents of the main- 
land. The fifty men of Japanese ancestry and the twenty-five Island 
Caucasians were measured in the separation center at Schoefield Bar- 
racks. During the period of the measurements every Caucasian, and 
every man of pure Japanese ancestry to the number of fifty, who 
passed through the separation center was measured. It is hoped that 
in this way any significant sampling error was avoided. It was in- 
tended that fifty Caucasians from the Schoefield separation center 
would be measured, but when the incidence fell to two Caucasians 
a week it became necessary to complete the Caucasian measurements 
on the mainland. The twenty-five mainland Caucasians were Navy 
enlisted personnel who were measured in the Seattle separation cen- 
ter. The two racial samples showed no significant difference in age. 
All measurements were made between June 1 and September 1, 1947. 

The choice of the bodily parts that were measured was dictated 
almost entirely by the presence of stable and easily found landmarks. 
These parts were usually not skeletal units. Care was taken to avoid 
dimensions that would be influenced by different degrees of fat or 
muscle. The men were nude and for measurements 1 to 14 lay su- 
pine upon a wooden topped table covered with three thicknesses of 
Army blankets over which was drawn a sheet of paper such as is 
used in pediatric examinations. Ink dots were made on the skin 
over the skeletal landmarks and checked for accuracy in measures 
1 to 11. The men lay with legs straight, with heels together against 
the foot-board, with no pillow under the head, and, except in tak- 
ing arm measurements (measures 10 and 11), at which time the left 
lower arm was flexed at ninety degrees and laid across the abdomen, 
the arms were held at the sides parallel to the long axis of the body. 
Measures 1, 2, and 4 were made on the right side, and measures 9, 10, 
11, and 14 on the left side of the body. This was dictated by conveni- 
ence. Measures 15, 16, 17, and 18, were made with the men sitting on 
the table, their legs dangling over the side. The reference points be- 
tween which the measures were made are as follows. 


Sole of the foot to edge of medial malleolus (right leg). 
Thence to medial intercondyloid fossa of knee (right leg). 
Thence to symphysis pubis. 

Thence to iliac crest (right). 

Thence to substernal notch above xiphoid process. 

Thence to suprasternal (jugular) notch. 

Thence to vertex. 

Distance between iliac crests. 


Perr ry Pr 








PAUL HORST AND STEVENSON SMITH 273 


9. Suprasternal notch to tip of acromion of left scapula. 

10. Acromion to olecranon (upper arm parallel to long axis, left 
elbow flexed at 90 degrees). 

11. Olecranon to the distal edge of the head of the left ulna. 

12. Distance between the medial commissure of right eye and of 
left eye. 

13. Distance spanning the four upper incisors at point of great- 
est width. 

14. Left ear from the lowest point of the curved floor formed 
by the antihelix and the antitragus to the farthest outside 
margin of the helix. 

15. Greatest width between the zygomatic prominences of the 
malar bones. 

16. Greatest width between temporal protuberances above ears. 

17. Supraorbital ridge to farthest occipital point. 

18. Point of chin (mental protuberance) to farthest point on 
cranium. 

19. Stature. 


Measurements 1 to 7 were taken with the horizontal stadio- 
meter, so that these are distances between levels or planes to which 
the long axis of the body is perpendicular. These planes are deter- 
mined by skeletal landmarks noted above, but the actual distances 
between the landmarks as measured by calipers would not neces- 
sarily be the same as the distances between these levels as here re- 
corded. Measurements 8 to 11 and 15 to 18 were made with slid- 
ing wooden calipers scaled to one-tenth inch. Measurements 12 to 
14 were made with steel calipers ground to sharp edges that elimi- 
nated the “inside” contact surfaces. These calipers (Lufkin Rule 
Company, No. 455) were scaled 32 to the inch. The measures were 
later converted to hundredths of an inch. 

The horizontal stadiometer was constructed by George Price, 
instrument maker in the psychology shop of the University of Wash- 
ington. It consists of a two inch chromium-plated tube scaled to 
inches, on which slides a cylindrical sleeve with a horizontal one- 
inch slit above a scale graduated to tenths of an inch from right to 
left. Through this slit are visible the numerals on the tube that is 
graduated to inches. Thus the position of the sleeve may be read 
accurately to a tenth of an inch. The sliding sleeve supports a tri- 
angular plastic pointer that is jointed in the middle so that the 
point of the triangle may be approximated to any ink-dot on the skin 
surface, the longitudinal position of which is then read off on the 























274 PSYCHOMETRIKA 
J nies, J rete 
} 4 
8 14 
” et, Cc 
} aA awe 
ie 
; } el 
9 15 


a 


d 








? 





Ss 


oO 


% 








i 





; 











Ge Re 
} a. ha 
12 18 

c 


: 

















SEEIEEE 





ie) 





re 


ou? 











The position of the mean for each distribution is indicated by an arrowhead. 


FIGURE 1 


Distributions of 18 Anthropometric Measures for Japanese (J) and Cau- 


casian (C). 








PAUL HORST AND STEVENSON SMITH 275 


scale. As the surface of the foot-board is at zero, the various meas- 
ures are distances on the long axis of the body from the plantar sur- 
face of the feet. The graduated tube clamps to the table edge and 
holds the foot board in rigid position. 


8. Statistical Procedure 


It was desired to get a composite index based on the various 
skeletal measures which would yield the maximum differentiation 
between the two groups. The following two conditions were imposed 
on the index: 


1. It should, for the sake of convenience, be a linear combi- 
nation of the skeletal measures. 


2. The critical ratio of the difference between the means of 
the indices of the two racial groups should be a maximum; that is, 
the ratio of the difference to the standard error of the difference 
should be a maximum. 


The mathematical function which satisfies the second condition 
has been called the discriminant function. However, Wherry (4), 
has shown that the weights given by the discriminant function are 











TABLE 1 
Significance of Mean Differences 
Measurement Means t 
Caucasians Japanese 
(2) .2102 .2012 5.7* 
(8) .2401 2312 5.1* 
(4) .0637 .0615 1.6 
(5) .1660 1745 3.6* 
(6) .0893 .0913 1.6 
(7) .1839 .1929 6.6* 
(8) 13867 1445 4.8* 
(9) 1095 1121 2.4 
(10) .2139 .2085 4,4* 
(11) .1591 .1566 2.4 
(12) 1.199 1.362 7.4* 
(13) 1.167 1.169 0.1 
(14) 1.9038 1.886 0.7 
(15) 5.494 5.796 6.0* 
(16) 6.098 6.346 5.9* 
(17) 7.716 7.430 Sa" 
(18) 9.922 9.878 0.8 
(19) 68.64 64.98 8.3* 





*t significant at 1% level. 








PSYCHOMETRIKA 


© 
&X 
N 








Zes'0T 698" 6L6° 886'T LOTS eee't $96" LZ9'T 6s0'— __ sss— _  9T9° L08° wing 
698° 000°T scr 802— s9r'— 9sr'— 992° yos— ssr— sIs— szr° 609° (6T) It 
6L6° Sor 000°T T10o— t190°— OrI'— 901 as GE- es oF 68T° (LT) or 
8e6'T 802— 110 — 000°T 669° PLE" 82e— S928 L&o 261° cau Ss SIs (oT) 6 
LVS 89t— 190— 669° 000°T GLY LO'— #98" OFT’ 9FT est sss-- (91) 8 
eee't 9sh— OFT — PLE GLY 000°T s9l— #98 28a" Tzo° ost— 6Prs-— (2T) L 
$96" 993° 90T° stz—- ss LOT'— s9T— 000°T 0r0" r9s— 98c'— 9g 66h" (OT) 9 
LZ9'T ms s- <a ros" rss 0F0" 000°T SIT 910° v9 — 20° (3) g 
6s0'—_ 98P'—_ ss BSS CLE OFT’ 28a" v9 — SIT 000'T too— Lis seo—. (2) v 
sos SIS 008 <zé6T 9PT Tze" 989°— 920° 100’—  +=000°T Orr— s6s— (g) 8 
919° A 602° 6z¢—  6ST'— 082— S98 r9I— LIS— OFF — 000T 902° (g) Z 
L08° 609° 68T° sIs— 3%e— 6v8'— 667° +20" ess Of 907 000'T (2) T 
“sDaW 279D 
Ww A 

ung (6T) (LT) (9T) (ST) (ZT) (OT) (8) (L) (g) (§) (2) 

It ot 6 8 L 9 g v $ Z T 








Sa[qvlivA JUspuedeq Jo suolze[aI10d10}UT 


6 ATaVL 








PAUL HORST AND STEVENSON SMITH ry 6 


proportional to the conventional regression weights where the de- 
pendent variable is dichotomous, all the members of one group being 
assigned a value of zero and those of the other group a value of unity. 
The conventional multiple regression procedure was therefore ap- 
plied to the data with certain modifications to be described later. To 
reduce the computational work, the significance of the difference be- 
tween the two groups for each variable was first tested. Those vari- 
ables not significant at the 1% level were not considered in the fur- 
ther analysis. Figure 1 shows the distributions of the two groups 
on Measures 2 through 19. (Measure 1 was discarded because of ob- 
vious non-significance. ) 


Eleven variables showed differences significant at the 1% level. 
These are indicated in Table 1. For these eleven variables product- 
moment inter-correlations were calculated and are presented in Table 
2. Point biserial validity coefficients were calculated for each of the 
eleven variables using criterion values of 0 for Japanese and 1 for 
Caucasians. 


To determine the weights, it would have been possible to use the 
conventional Doolittle method. This would have been somewhat la- 
borious. One of the authors therefore developed a method which, by 
a routine of successive approximations, achieves the following ends: 

1. It selects only those variables which make independent con- 
tributions toward group differentiation. 

2. The optimal weights to be assigned to the selected variables 
are determined as a part of the selection process. 

It is believed that the method has several advantages over other 
methods, for example, the Wherry-Doolittle Method (3, 245 ff.), in 
the determination of optimal weight and the selection of predictive 
variables. Among these are: (1) It is faster than any other method 
with which we are familiar. (2) A good approximation to the mul- 
tiple correlation is available at each step with very little additional 
computation. The method requires only a single worksheet in addi- 
tion to the table of intercorrelations. See Table 3. 

The first column headed ir, is simply the validity coefficients. 
At the end of the column in Row 12 is —.515, the sum of the values 
above. The highest value, .642 in Row 11, is enclosed in parenthe- 
ses. Measure (19) is the first one selected and its tentative weight 
is .642. Its square, .412164, is entered in Row 13. Calling this 
value R,?, we calculate 

,_NR?—n 
i N—n 





278 PSYCHOMETRIKA 








TABLE 3 
Worksheet 
Row Meas. iy, Zr. 3r, 4r, ory 6r, vw, 8r, 
1 (2) 494 103 012 —.086 —159 —.046 —.078 —.075 
2 (3) 455 182 117 045 .002 081 045 -026 
3 (5) —.345 —141 —.081 024 024 —.035 016 .025 
a 4%) —.556 —.245 —.188 (—.186) 000 —.090 —.051 —.038 
5 (8) —484 —.265 —116 —124 —.108 —.157 (—.115) -000 
6 (10) 404 .240 (.196) 000 —.036 011 —.007 —.002 
7 (12) —.600 —320 —126 —.093 —.055 —136 —.112 —.071 
8 (15) —.517 (—.409) .090 021 -040 -009 019 061 
9 (16) —515 —381 —.095 —.050 —.018 —.056 —.054 —.017 
10 (17) 457 .163 .138 a .086 (.171) 000 —.028 
11 (19) (.642) .000 —069 —.119 (—.185) .000 —.078 (—.108) 
12 Total —.515 —1.0783 —.212 —401 —409 —.248 —.415 —.227 
138 2 -412164 .579445 .617861 .636357 .670582 .699823 .713043 .724712 
14 Rk? 406226 .570862 .606042 .621101 .656856 .684024 .694732 .707140 
Row Meas. 9r, 10r, aa9, 12r, 137, 14r, 15r, 16r, 
1 (2) —.009 —.050 —.028 —.066 —.025 —.038 —.048 —.087 
2 (8 072 039 .055 032 061 (.047) -000 -008 
38 (5) —.009 017 .003 003 —.019 - —.007 014 -007 
4 (7) —.090 —.057 (—.071) .000 —.033 —.019 —.004 —.011 
b {8) —.029 013 —023 —015 —.033 —.013 —.005 —.023 
6 (10) .026 .006 017 —.002 015 001 —.016 —.011 
7 (12) (—.118) .000 —047 —.027 —.057 —.034 —.021 (—.044) 
8 (15) .043 (.099) .000 010 —.001 .042 (.049) .000 
9 (16) —.089 005 —.064 —.047 (—.061) 000 011 —.023 
10 (17) .021 .004 .010 —.006 .025 024 014 017 
11 (19) 000 —.051 —.034 (—.068) 000 —013 —.033 —.025 
12 Total —.132 025 —182 —186 —128 —.010 —.0389 —.142 
13 R? -738636 .748487 .758478 .758102 .761823 .764032 .766433 .768369 
14 R2 .718963 .729502 .734923 .739895 .741112 .740695 .743333 .745460 
Row Meas. 17r, 187, 197, Wt. a Wt./o 
1 (2) (—.052) 000 —.015 —.052 .00907 —5.733 
2 (8) —.004 007 —.002 047 .00979 4,808 
3 (5) 017 —.004 —.004 
4 (7) 001 (—.027) .000 —.234 .00809 —28.925 
5 (8) —.007 —.006 —.003 —.115 .00895 —212.849 
6 (10) —.018 008 001 196 .00666 29.429 
7 (12) 000 —.018 —.010 —.162 .1361 —1.190 
8 (15) .021 .009 .013 —.261 .292 —0.894 
9 (16) —.007 —.018 —.012 —.061 .241 —0.258 
10 (17) 011 021 .015 171 = 318 0.546 
11 (19) —.044 —.012 (—.025) .256 2.85 0.090 
12 Total —.082 —.040 —.042 
13 R? -771073 771802 .772427 


14 R,? -745637 .746447 .747141 





where 


N is the number of cases, 
nthe number of variables selected. 
N is 100 and the number of variables selected is 1 so we have 











PAUL HORST AND STEVENSON SMITH 279 


__ (412164) X 100-1 
Fr 100 —1 


= .406226. 





R2 


This value is entered in Row 14. 

We are now ready to calculate the second column of the work 
sheet, 2r,. Since the selected value in the first column comes from 
Row 11, we multiply each entry in Column 11 of the correlation table 
by it and subtract the product from the corresponding element in 
Column 1 of the worksheet. The remainder we put in the second 
column (27,). Thus, - 
.494 — .642 X .609 = .103, 

455 — .642 X .425 = .182, 
etc. 
We also include the sum row in these computations so that 


—.515 — .642 X .869 =—1.078. 


Now we add the first eleven entries in 27, to see that they actually 
total to —1.073. It is important that this check be carried. The 
speed of this method is due in part to the fact that it does not auto- 
matically correct errors as do some other iterative methods. 

We next select the largest absolute value in 27, and enclose it 
in parenthesis. This is —.409 in Row 8, and its tentative weight is 
—.409. 

We add the square of this value to the preceding value in Row 
13 and enter the sum in the 27, column. Thus, 


.412164 + (—.409)2 = .579445. 
We calculate 


Since this is the second variable selected, is now 2, so that 


_ 100 X (.579445) — 2 
R2= 
2 100 — 2 


= .570862. 





This value is entered in Row 14. Row 14 represents the appli- 
cation of the Wherry shrinkage formula to multiple correlation co- 
efficients and serves to indicate when enough variables have been 
selected. 

Column 3r, of the worksheet is obtained from Column 2r,, the 
selected value in it, and Column 8 of the correlation table. Thus, 











PSYCHOMETRIKA 


.103 — (—.409) * (—.222) = .012, 

.182 — (—.409) X (—.159) =.117, 

—1.073 — (—.409) X (—2.107) =—.212. 
The entries above the check should total to —.212. Then we find the 
highest absolute value, .196 in Row 6. Variable 6 is the third one 


selected and its tentative weight is .196. The square of this is added 
to R.”, and we get 


.579445 + (.196)? = .617861. 
This sum is entered in Row 13. 
Re is calculated 


R2= NR,2—n _ 100 X (.617861) —3 


= .606402 
. N—s 100 — 3 — 





and entered in Row 14. 


In the same way Variable 4 is selected and R,’ and &,* are calcu- 


lated. When we come to Column 57, , however, we see that the larg- 
est entry is —.185 for Variable 11, which has already been selected. 
This means that the tentative weight of .642 for Variable 11 must 
be reduced by this amount due to the selection of additional variables. 
We do not, however, make the adjustment at this time. 

We calculate R,? in the usual manner. 


R;? = .636357 + (.185)* = .670582 . 


To calculate R.?, however, we do not increase 7, the number of vari- 
ables selected, since Variable 11 is not a new variable. 

Since only 4 variables have been selected, 
__ 670582 X 100 — 4 


k2= = .656856. 
' 100 — 4 





The rule, therefore, is that in calculating any R,? the n-value to be 
used shall be only the number of variables already selected, rather 
than the number of the column in the work sheet. 

The routine may normally proceed until the last calculated value 
in the R,? row is less than the preceding value. R22, is .741112 and 


R.. is .740695. Therefore, we should normally conclude the itera- 


tions with the selection of Variable 9. As indicated by the worksheet, 
the computations in this study were continued until all but the third 











PAUL HORST AND STEVENSON SMITH 


variable were selected. It may be noted that R,;? is .761823, where- 
as the subsequent selection of Variables 1 and 2 brings R,,? to only 
.172427 . This constitutes an increase of only one point in the second 
decimal, so that with negligible loss in discrimination we could have 
used only 8 of the 11 variables instead of 10. 

The weight to be assigned to each of the 10 variables is simply 
the algebraic sum of the values in parentheses in its correspond- 
ing row of the worksheet. These weights are given in the “Wt.” 
column at the right, immediately following Column 19. We see, for 
example, that the weight of’ Variable 11 is 


.642 — .185 — .108 — .068 — .025 = .256. 


These weights are comparable to the conventional beta-regression 
weights based on standard measures. They would be identical to the 
beta-weights if the analysis had been continued until the final 7, col- 
umn contained only zeros. 

In this study the various skeletal measures were not actually 
reduced to standard measures with zero means and unit standard 





CAUCASIANS 





JAPANESE 








WEIGHTED SCORES 





Groups. 








FIGURE 2 


Distributions of the Composite Index for the Japanese and the Caucasian 














282 PSYCHOMETRIKA 


deviations. It seemed of interest actually to calculate an index for 
each person in the study so that the distributions of indices of the 
two racial groups could be compared. Therefore, it was necessary to 
divide the beta-weights by the corresponding standard deviations to 
get weights to be applied to the arbitrary unit measures. These 
weights are given in the final column of the worksheet. They were 
then applied to the 10 selected skeletal measures to get an index for 
each member of each racial group. Distributions of the indices for 
both groups are given in the Figure 2. Only 3 Japanese have an in- 
dex higher than the lowest Caucasian, and only 1 Caucasian has an 
index lower than the highest Japanese. 


4. Mathematical Derivation 

The mathematical equations upon which the procedure is based 
were derived by one of the authors (2) some years ago for the spe- 
cial case of dichotomous variables. More recently Gengerelli (1) has 
published a method identical to the initial steps of the one here out- 
lined. 

Assume both the dependent variable and the independent vari- 
ables to be given in standard units so that all means are zero and 
all standard deviations are unity. We let y, be the dependent or cri- 
terion variable, and. (x%,, --- X,) the independent variables. We he- 
gin by finding that x; which correlates highest with y, and call this 
x,. This we do by calculating all ry, , that is, all validity coefficients , 


and selecting the highest one. We start with the regression equation 

Y1 — b1%1 = Y2. (1) 
Since all measures are normalized, the regression coefficient b, is 
simply r,,- This is the value in parentheses in Column 1 of the 
worksheet discussed in the previous section. 

Now y- is the residual from y, and is independent of x,. We 
therefore find which of the remaining variables correlates most 
highly with y.. Actually, however, we are not concerned with the 
correlations Ty,2, as such, but only with the regression coefficents 
by, 2, It can readily be shown that the x variable yielding the larg- 
est regression coefficient is the one which correlates highest with y.. 
It is also well known that 


Lyx 
ee 





(2) 


where all x’s are normalized. Hence, from (1) and (2), 








PAUL HORST AND STEVENSON SMITH 283 


by, 2 = Tye — by Te,2- (3) 
We therefore calculate the correlations of x, with all the other vari- 
ables and get all possible by,2 from Equation (3). This is the sec- 


ond column of the worksheet. 
We next select the x; having the largest by,, and call this x2. We 


also designate its corresponding b, 2 merely bz. Then we consider 
the residual equation ; 

Y2— bX, = Ye . (4) 
As before, we look for the variable which correlates most highly with 
the residual y; by finding the highest regression coefficient by, We 
have 





oad (5) 
? N 
But from (1) and (4) 
Ys = Y1 — 61%, — Dede, (6) 
and from (5) and (6) 
bye = Tye — Ort ie — bef 22. (7) 


We have already calculated in Column 2 of the worksheet all values 
Tye — b:7z and from this column selected b. and x,. Therefore, we 
calculate all correlations of x, with the other variables and get the 
third column of the worksheet by subtracting b.r., from the second 
column. In general then, the successive b’s are given by the follow- 
ing equations: 

b=, 

b.= ‘~ bifie, 

b= + bibis — betes, 

b,= oa Dig — Do% 24 — D374, 

etc. 


and these equations for the b’s are based on the relations: 


Y¥:—b:4%,=Y:2, 
Yo — b2t2= Ys , 
Y3 — b3%3 = Ys, 
etc. 


The successive columns of the worksheet are given by the equa- 
tions 





It should be made clear, however, that the successive subscripts, 
1, 2, 3, etc., do not necessarily represent distinct variables. For ex- 
ample, x, and x; may be the same variable, as was actually the case 
in the example presented above. Essentially the process has two dis- 
tinct characteristics. Each cycle either (1) selects a new variable 
and assigns a tentative weight to it, or (2) provides a correction to 
the weight of a previously determined variable. With a minor ex- 
ception to be indicated later, the computational routine is the same, 
irrespective of which objective a particular cycle accomplishes. 

We shall now consider a more precise interpretation of Row 
13 (the Rk? row) in the worksheet. We consider first the residual 
variance of y at each cycle, that is, oy, vy,, etc. From equation (1) 


we have 


Syix 
But a 





4 


Similarly, from (4) 


But since the largest by, was designated b., we have from (2) 








PSYCHOMETRIKA 


yy, FRM 


fy =, — bir 
Yo Vy 1/1zr9 

iy; Ty, — befoz, 

ry, cee Ty, sami bsfsz ? 


ete. 











Dy? —_ D(Y1 — 0141)? 
cs = 
2 N N 
' Lyit1 : 
— 3 BU} T Oy 
N 


ty = 0,5 therefore (8) becomes 


oy: = a piel b;*. 


dys" >» (Yo — bes)? 





1s WN N 


DY 2X2 





+ b,?. 


= oy, 2b. 


SY 2%2 
i 








(8) 


(9) 


(10) 


(11) 














PAUL HORST AND STEVENSON SMITH 


and from (10) and (11) 


oy = 0,2 — be’. (12) 
From (9) and (12) 
“=i — 5, — b. (13) 
For the successive residual variances we have, therefore, 
oy = 1, 
oy? = 1—),?, 


o2 = 1—b,2—b2, 
= 1 ta b,2— b.2 — b;*, 
etc. 


If we define the square of the multiple correlation as 1 minus 
the residual variance, then we have for the successive estimates of R? 
R;* — _ , 
R.? = b,? + b.?, 
R,* = b,? + b,? + 5;?, 
etc. 


which are the values given in Row 13 of the work sheet. 

The R.2 row of the work sheet is based on the Wherry shrinkage 
formula for correcting multiple correlation coefficients for chance, 
viz., 


NR? —n 


he 
N—n 


where N is the number of cases, m the number of variables,* R the 
uncorrected multiple correlation coefficient, and R, the corrected co- 
efficient. It is only in the calculation of the R22 row that one must 
observe whether the cycle adds a new variable or merely adjusts the 
weight of one previously selected. In the former case, 7 is in- 
creased by 1 from its previous value, while in the latter case it re- 
mains the same. 

The cycles are continued until the R.? is less than the one im- 
mediately preceding. As a matter of fact, if it remains the same, 


*To be theoretically accurate should include also the dependent variable. 
For the average computing clerk it is a little simpler to let n be merely the 
number of rows in the worksheet in which at least one parenthesized value is 
found. For values of N large enough to justify the analysis the results are 
practically the same. 











286 PSYCHOMETRIKA 


one may as well discontinue the computations, as the addition of 
variables in general will complicate the use of the battery in later 
application without improving the predictive value. 

The final regression weight for a variable is obviously the sim- 
ple algebraic sum of the b values corresponding to it. 

The question may well be raised whether the method as out- 
lined gives a sufficiently accurate estimate at any cycle of (1) the 
true multiple correlation which an exact solution would yield, and 
(2) the correlation of the weighted sum of variables with the cri- 
terion. The first question may be answered by observing that at any 
point one may ignore new variables and utilize further cycles simply 
for correcting previously determined weights. So far as the R? is con- 
cerned, this consists of the cumulation of squares of rapidly decreas- 
ing quantities, so that in most cases the R? is not affected by more 
than 1 in the second decimal. The actual R would be influenced 
even less. 

As for the second question, our procedure assumes that 

R?Z=—r? = (b,? + b,? + «++ Bj?) (14) 


Y (B1t1+...+B 121) 
where each f is not, strictly speaking, a regression coefficient but 
consists of the sum of all b’s involving the variable corresponding 
to that particular 6; where / is the number of variables selected; and 
where 7 is the number of cycles. If we let 





Cea (15) 
it can be shown that 
l 2 
( eo Bi rm, ) 
Pp? = : (16) 
> Bi Bj rij 


where i and j take all values from 1 tol. 
Let us further examine the composition of the right side of 
(16). We let 


B; =a vector of the f’s at cycle7z, 

y =the matrix of intercorrelations, 

ry, =the vector of validity coefficients, and 

E; =the residual vector of validity coefficients at cycle 7. 


Then (16) can be rewritten 














PAUL HORST AND STEVENSON SMITH 287 


(B’; Ty)? 
7? — —_———_. 16a 
p B'; rB; J ( ) 
But it can readily be shown that 
E,=1,—r7B; (17) 
or 
rB,=r7r,— Ei. (18) 
Hence 
B'; rB,=B'; T,— B’; E;. (19) 
We now write 
Bin =B; - Dias Gist (20) 


where b;,, is the highest value in the EH; vector and éj. is a vector 
with all zero elements except the (i + 1)th which is unity. 
Also 
Bin = Ey — bin Tin. (21) 


Then from (20) and (21) 
Bins Eins — B’; E; a Diss B’; Vist + Din Biiny— Din? T (441) (i#1) « (22) 
But from (18) 





B's ines = yin) — Biciny 5 (23) 
since 7j,, is the (¢ + 1)th column of 7. Also, 
Bicis) = din, (24) 
Tiny (in) = 1. (25) 
Substituting from (23), (24), (25) in (22), we get 
Bin Bin = BY Bi — Disa Tyinsy + Oia? (26) 
Also from (20) 
Bina Vy = Bi ry + Dias yin « (27) 
From (16) and (19) 
p= a ae = (28) 
Or, by analogy to (28), 
Bia ty 
pin? — (29) 





Bian Ty mee Bin Ein 

















288 PSYCHOMETRIKA 


Substituting from (26) and (27) in (29), we have 


(B’; Ty 7 Diss Tyan)? 
in? = — (B; Ej — Diss Tyissty + Din?). 30 
Pi+1 (B; ry + Bier ~ ( y(it1) 1”) (30) 
Suppose now we begin with i = 1 so that B, is simply the scalar 
b,. Then B’, E, = 0 and 





— (b: Ty + be Ty2)? 
i (b; Ty be Tye) + be Yy2— b.? ‘ 





(31) 


Similarly, 


wi - (1 Ty =F > Ty2 ate bs Ty)? . -, (32) 
(b, Ty = bs Ty2 i bs ys) ‘1 i bs Tye is b; Ty3 ane b,? cag bs 





or in general, 





b; Ty)? 
_—_ __(Sbirn) sia 
(Sdityi) “2 ry: ae bity: pie (307? ici b,”) 


But it will be remembered that b, is simply the largest 7,, viz., 

Ti; hence ry, = b, , and (33) becomes 

a bi ty)? 

ve 2D; A Mage Sb; _ 
Actually (34) gives precisely the correlation of y with the lin- 
ear combination of the x variables weighted according to the b’s at 
any point in the selection-approximation procedure. Instead of using 
the b;? as an estimate of the square of the multiple correlation co- 
efficient, we could have used (34). From Equations (26) and (27) 
the values for the numerator and denominator terms of the succes- 
Sive pi”s can be cumulated. However, the computations are consid- 
erably more involved than for R;? = }b;?; and with empirical tests 
made to date, the results agree in general in the second decimal. The 





writer therefore believes that for all practical purposes R;? = > b;? 


given in the Row 13 of the work sheet may be taken as a very good 
estimate of the square of the multiple correlation coefficient at each 
cycle in the computations. 


REFERENCES 
1. Gengerelli, J. A. A simplified method for approximating regression coeffici- 
ents. Psychometrika, 1948, 13, 185-146. 
2. Horst, Paul. Item analysis by the method of successive residuals. J. exp. 
Educ., 1934, 2, 254-263. 








eo 


PAUL HORST AND STEVENSON SMITH 289 


Stead, W. H., and Shartle, C. L. Occupational Counseling Techniques. New 
York: American Book Co., 1940. 

Wherry, Robert J. Multiple bi-serial and multiple point bi-serial correla- 
tion. Psychometrika, 1947, 12, 189-195. 














PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


AN EXPERIMENTAL STUDY OF THE EFFECTS ON ITEM- 
ANALYSIS DATA OF CHANGING ITEM PLACEMENT 
AND TEST TIME LIMIT 


WILLIAM G. MOLLENKOPF* 
EDUCATIONAL TESTING SERVICE 


Item-analysis data are usually obtained from a single test ad- 
ministration, with a given item sequence and time limit. Questions 
can be raised as to the effects upon item data resulting from changes 
in item-position and test-timing. In this study, two forms of a ver- 
bal test and two forms of a mathematics test were used. In each 
case, both forms of each test contained the same items, but items 
coming early in one form were placed late in the other. Each of 
these forms was administered once with a short time limit and once 
with generous timing to comparable groups of high school stu- 
dents. The relationships of various speed and power scores were 
determined, and the changes which occurred during the added time 
were studied. Values of the item indices p (proportion right), A+ 
(another difficulty index), and the item-test biserial correlation 
coefficient were obtained for both the speed and the power condi- 
tions and were systematically compared. The proportion right of 
those attempting the item, the 4 index, and the biserial r were all 
found to have undesirable characteristics for items appearing late 
in a speeded test. 


Item-analysis data are usually derived from a single administra- 
tion of a try-out form with a given sequence of items and a given 
time limit. Interest in the effects upon item-analysis data of chang- 
ing either the location of the items or the test time led to the pres- 
ent study, which involves the three item indices routinely used in the 
construction of the tests of the College Entrance Examination Board. 
These indices are the proportion of those attempting an item who 
answer it correctly (p), the difficulty index Aj}, and the biserial co- 
efficient of correlation between performance on the item and total 
test score. 

In the development of items for use in College Entrance Exami- 
nation Board tests (for example, the Scholastic Aptitude Test), sets 
similar as possible to those that will prevail when the final forms are 
of items are tried out in experimental forms, under conditions as 
administered. The items are editorially reviewed and screened be- 
fore their experimental use, and are arranged in rough order of dif- 


*The author gratefully acknowledges the suggestions and criticisms of Dr. 
Harold Gulliksen, Research Adviser at the Educational Testing Service. 


+For definition of 4, see footnote, p. 301. 
291 











292 PSYCHOMETRIKA 


ficulty insofar as the subjective estimates of the item writers permit 
this to be accomplished. In order to secure as much data as possible 
from the experimental try-out and in order to approximate condi- 
tions that obtain in the final forms of the tests, the number of items 
included in the experimental forms is made sufficiently great so that 
a substantial proportion of the subjects do not finish in the time al- 
lowed. 

When final forms are prepared, items may be selected either 
from a single experimental booklet or from a number of different 
try-out tests. In either event questions may be raised about the ef- 
fects upon the item-analysis data of changing the placement of the 
item and of altering the speededness of the material. Even if all 
subjects attempted every item, would the p value, the 4 index, and 
biserial 7 of an item be the same when this item occurs early in a test 
as when it occurs late? How do values of these indices for items to- 
ward the end of a test for a group working under very speeded condi- 
tions compare with those obtained for a comparable group working 
under power conditions? To secure experimentally derived answers 
to these questions, a research project sponsored by the College En- 
trance Examination Board was carried out at the Educational Test- 
ing Service. 

A verbal aptitude and a mathematical aptitude test were con- 
structed.* The verbal test consisted of six-choice items, while the 
mathematics test contained five-choice items. Each test contained 
three sets of items. Extensive use was made of item-analysis data in 
making the first and third sets parallel, with a grading of difficulty 
from easy to hard. The middle set in each test consisted of items of 
medium difficulty (p values ranged from about .35 to about .65). 
Two forms of each test were prepared using the same items, with the 
set occurring first in one form appearing last in the other, and vice 
versa. The verbal test consisted of 85 antonyms items, with sets one 
and three having 30 items each and the middle set, 25. The mathe- 
matics test contained 36 items, with 12 items in each of the three sets. 


Administration of the Experimental Tests 


These tests were administered at the public high school in Scars- 
dale, New York, in June, 1948, to all students in the junior and sen- 


*The preparation of these tests and the initial planning of the project were 
carried out by Dr. William B. Michael of San Jose State College (at that time a 
member of the Research Department of the Educational Testing Service). The 
present writer assisted in planning the administration of the tests, and is re- 
sponsible for the analysis of the data and for this report. 








WILLIAM G. MOLLENKOPF 293 


ior classes.* Four groups of students were formed from these classes 
by assigning to a particular group every fourth person whose name 
appeared on alphabetical lists of junior boys, junior girls, senior boys, 
and senior girls. As a check on the comparabilities of the resultant 
four groups in verbal and in mathematical ability, from the exten- 
sive standardized test data secured regularly in the Scarsdale school 
for each student there were selected three scores in the verbal area 
and one in the quantitative area. These were (1) the vocabulary 
score, (2) the speed of comprehension score, and (3) the level of 
comprehension score on the-~Cooperative Reading Comprehension 
Test C-2, Higher Level, Form S, and (4) the Q-score from the Ameri- 
can Council on Education Psychological Examination. These four 
scores were available for almost every student participating in the 
study. By reference to Table 1 it can be seen that the groups were 
very well balanced. 


TABLE i 
Means and Standard Deviations on Three Verbal Measures and One Quantita- 
tive Measure for Four Groups Involved in the Experiment 











Test Statistic Group 
1 2 3 4 

1. Vocabulary Mean 60.1 60.4 60.4 60.6 
Std. Dev. 10.2 10.3 10.1 9.0 

No.of Cases 98 93 92 95 

2. Speed of Mean 63.2 63.2 64.2 64.1 
Comprehension Std. Dev. 12.5 12.8 ri yg 9.2 
No.of Cases 98 93 92 95 

3. Level of Mean 59.8 58.9 60.7 60.4 
Comprehension Std. Dev. 10.7 9.7 11.1 7.3 
No.of Cases 98 93 92 95 

4, A.C.E. Mean 48.3 49.8 48.9 50.4 
Q-score Std. Dev. 9.8 9.2 9.7 9.1 
No.of Cases 98 89 86 92 





One experimental group was then given Form 1 of each test, 
while a second experimental group took Form 2, with both groups 
being given adequate time for all subjects to attempt every item (50 
minutes for the verbal and 55 minutes for the mathematics forms). 
At other testing sessions, a third group took Form 1 and a fourth 
group took Form 2 under a time limit so set that most students at- 
tempted all of the items in the first set, while very few tried all the 


*The cooperation of Mr. Lester W. Nelson, Principal, and of a number of 
members of his faculty is gratefully acknowledged. 








294 PSYCHOMETRIKA 


items in the test. (The time limits were 15 minutes for the verbal 
and 20 minutes for the mathematics forms.) At the end of this short 
time limit, which will henceforth be called the speed condition, the 
red pencils with which the students had been working were collected, 
and blue pencils were distributed. These groups then continued 
work until their total time equaled that for the first two groups, and 
were instructed not only to mark responses to items previously 
not reached but also to change any previously marked response if 
they so desired. A change of response was indicated by marking a 
large X through the original response on the answer sheet, and then 
making a new response. The color of the X depended upon whether 
the response was changed before or after the change to the power 
condition, at which time the pencils were changed. The plan fol- 
lowed in giving the tests is summarized in Table 2. 


TABLE 2 
Testing Arrangements 











First Day Second Day 
Verbal Mathematics 
Group’ _Test Condition Group Test Condition 
1 Form 1 Power 1 Form 1 Speed followed 
by power 
2 Form 2 Power 4 Form 2 Speed followed 
by power 
3 Form 1 Speed followed 2 Form 1 Power 
by power 
4 Form 2 Speed followed 3 Form 2 Power 
by power 





For tests given under power conditions only, part scores were 
obtained on the three sets of items. Statistical measures derived from 
these scores and the total scores are given in Tables 3 and 4. 


TABLE 3 
Means, Standard Deviations, and Intercorrelations of Part and Total Scores on 
the Verbal Tests, Power Condition 











Form 1 Form 2 
Group 1, N = 100 Group 2, N = 93 
Part1 Part2 Part3 Total Part1 Part2 Part3 Total 
1 89 91 96 f 87 91 .96 
4 89 .90 .96 2 .87 88 .95 
3 91 .90 97 3 91 .88 97 
Total .96 .96 97 Total .96 95 97 


Mean 15.20 12.25 14.54 41.99 Mean 15.39 12.47 14.09 41.95 
o 6.15 6.18 6.67 18.35 o 6.21 6.13 6.70 18.32 











WILLIAM G. MOLLENKOPF 295 


TABLE 4 
Means, Standard Deviations, and Intercorrelations of Part and Total Scores on 
the Mathematics Tests, Power Condition 











Form 1 Form 2 
Group 2, N = 93 Group 3, N = 93 
Part1 Part2 Part3 Total Part1 Part2 Part3 Total 
g 76 78 91 1 80 15 90 
2 76 83 94 2 .80 84 .95 
8 -78 83 93 3 75 84 94 
Total 91 94 98 -« Total .90 95 94 


Mean 6.73 6.52 6.52 19.76 Mean 6.43 5.87 6.27 18.57 
o 3.11 3.35 2.91 8.69 o 2.85 3.40 3.60 SL 





Five scores were obtained for each form taken first under the 
speed and then under the power condition: (1) the number of items 
attempted under speed, (2) the number right under speed, (3) the 
number right under power, (4) the number of responses made under 
speed that were changed under power, and (5) the gain (or loss) 
due to these changes. Statistics derived from these scores are given 
in Tables 5 and 6. The data for the two groups taking the verbal 
forms are more consistent than those for the groups taking the 
mathematics forms, especially the data relating to the number of 
items attempted. 

The very high correlations observed between the rights scores 
under the speed and power conditions for the verbal forms, and the 
high correlations between the speed and power scores for the mathe- 


TABLE 5 


Means, Standard Deviations, and Intercorrelations of Various Scores on 
Verbal Tests, Speed-Power Conditions 











Form 1 Form 2 
Group 3, N = 93 Group 4, N = 96 
3 Pr , z - 

ar hy he nm 2 he he “ fo - 
Bom Say Bay S525 Sh i. 2 Bsn Sane eof 
e833 #£E3 ESS EER: £28 B83 EES EES ESBS ees 
BSR Sak Boo Boao ass Sh2 seR Bus F295 4S 
Ain 240 Zee Zonk VAO Ain Zen Zam ZOnm CAO 

. 2 3 4 5 1 2 3 4 5 
1 73 54 135 19 1 62 .35 Al 37 
2 13 94 .24 21 2 62 91 .37 .39 
3 54 94 22 .29 3 35 91 .25 .38 
4 35 124 22 61 4 41 37 25 .53 

5 19 21 .29 61 5 37 .39 .38 53 

Mean 57.81 27.87 40.94 2.42 0.85 Mean 56.86 27.89 4243 2.30 0.64 
o 12.56 17.88 19.17 287 1.70 o 11.04 14.82 16.87 2.78 1.59 














PSYCHOMETRIKA 


TABLE 6 
Means, Standard Deviations, and Intercorrelations of Various Scores on 
Mathematics Tests, Speed-Power Conditions 











Form 1 Form 2 
Group 1, N = 100 Group 4, N = 96 
2 7 ° £O 2 rm [7 ty ° 2o 3 
4 ts 5 8 8 rs B38 
Sh a) he ow iy sh & he om 
AGE 22% 228 ESRo «38 ey 223 FEE AE ce: 
532 5. 53.6 Ssf86 a5e EB #4 BYR SHS BERS eet 
Zin Zen Zee ZOonkh GAO 2<4n Zen Zee Zon CAO 
2 3 4 5 1 2 3 4 5 
1 38 ~.03 41 24 } 63 25 61 .58 
2 .38 81 16 14 2 63 .82 .38 .52 
3 -.038 81 -.02 09 3 25 82 1% 35 
4 Al .16 -.02 .719 4 61 38 17 70 
5 .24 14 .09 79 5 .58 .52 35 70 
Mean 21.50 9.83 18.78 0.65 88 Mean 20.67 11.01 20.738 0.79 0.45 
a 7.00 4.80 7.82 1.33 1.03 o 6.31 6.47 8.17 1.46 1.06 





matics forms, were of sufficient magnitude to indicate a consider- 
able degree of relationship between the speed and power scores. In- 
deed, the size of the correlation coefficients may be a source of sur- 
prise to some workers in the field of mental tests. It is also appar- 
ent from these tables that the correlation between number attempted 
under speed and number right under power is distinctly higher for 
the verbal than for the mathematics forms. 


Drop-out 


A further type of information is given by the data on the “drop- 
out” under the speed condition. To permit comparisons, each test 
was broken down by sixths of the total number of questions. Table 7 
shows what percentage of the initial group for each test form at- 
tempted the last item in each one of these subsets. Evidently the 


TABLE 7 
Per Cent of Students Reaching Various 
Points in the Tests 














Ordinal 
Test and Form Number of Item 
14 28 42 56 70 85 
Verbal Form 1 100 100 89 58 19 6 
Verbal Form 2 100 100 94 54 138 8 
6 12 18 24 30 86 
Mathematics Form 1 99 96 69 32 14 8 


Mathematics Form 2 100 96 57 30 13 1 























WILLIAM G. MOLLENKOPF 297 


speededness of the mathematics forms was somewhat greater than 
that for the verbal forms. Considering the drop-out shown for each 
form and recalling the arrangement of the items according to diffi- 
culty level, one can also conclude that the average difficulty of the 
items not reached under the speed condition was greater than that of 
the items reached. 


Changes Made in Speed Responses under the Power Condition 


Tables 5 and 6 indicate that those who more often change their 
responses in any way when given added time tend to be individuals 
who work fast, and who get high scores under the speed condition. 
The changes tend to better the scores for the faster individuals more 
than for the slow ones, and tend to be made in the right direction 
more often by high-scoring than by low-scoring students. It is inter- 
esting to note the appreciable correlations, ranging from .53 to .79, 
between number of changes and gain due to changes. 

To permit study of how the responses made under the speed con- 
dition were changed under the power condition, item cards were 
punched with both sets of responses. The tabulator was then used 
to secure, for each item, a scatter plot containing as cell frequencies 
the numbers of persons with any given combination of speed and 
power responses (right, wrong, omitted, or not reached). These scat- 
ter plots were summed over items and the sums are given in Tables 
8-11. Each cell also contains in parentheses the proportion of the 
row total which corresponds to the cell frequency. 


TABLE 8 


Changes in Speed Responses Made Under Power Condition, Verbal Form 1 
(85 item test; N = 93) 


POWER RESPONSES 
Omit Wrong Right Total 











Right 2 52 2534 2588 
(001)  (.020)  (.979) (1.000) 
nN 
a Wrong 10 1701 131 1842 
Z (005)  (.924) (071) ~— (1.000) 
Omit 564 208 174 946 
a (596) (.220) = (.184) (1.000) 
Q ~— Not Reached 404 1161 964 2529 
7 (160)  (.459) — (.881)— (1.000) 
oa 
“” Total 980 8122 3803 7905 


(.124) (.395) (.481) (1.000) 

















298 


Changes in Speed Responses Made Under Power Condition, Verbal Form 2 


PSYCHOMETRIKA 


TABLE 9 


(85 item test; N = 96) 








POWER RESPONSES 
Omit Wrong Right Total 





SPEED RESPONSES 


Right 44 2640 2684 
(016)  (.984) — (1.000) 
Wrong 7 1818 107 1927 


(.003) (.941) (.056) (1.000) 


Omit 427 239 182 848 
(.503) (.282) (.215) (1.000') 


Not Reached 403 1158 1140 2701 
(.149) (.429) (.422) (1.000) 


Total 837 8254 4069 8160 
(.102) (.399) (.499) (1.000) 





Changes in Speed Responses Made Under Power Condition, Mathematics Form 1 


TABLE 10 


(36 item test; N = 100) 








POWER RESPONSES 
Omit Wrong Right Total 





SPEED RESPONSES 


Right 6 977 983 
(.006) (.994) (1.000) 

Wrong 2 622 48 672 
(.003) (.926) (.071) (1.000) 

Omit 292 72 132 496 
(.589) (.145) (.266) (1.000) 

Not Reached 271 458 720 1449 
(.187) (.316) (.497) (1.000) 

Total 565 1158 1877 3600 


(.157) (.822) (.521) (1.000) 


























WILLIAM G. MOLLENKOPF 299 


TABLE 11 
Changes in Speed Responses Made Under Power Condition, Mathematics Form 2 
(36 item test; N = 96) 


POWER RESPONSES 
Omit Wrong Right Total 











Right 1 6 1052 1059 
7 (.001) (.006) (.993) (1.000) 
Ss Wrong 2 544 54 600 
Z (.003) (.907) (.090) (1.000) 
A OOmit 200 51 74 325 
P (.615) (.157) (.228) (1.000) 
o Not Reached 298 365 809 1472 
= (.202) (.248) (.550) (1.000) 
” Total 501 966 1989 3456 


(.145) (.279) (.576) (1.000) 





Answers to a number of questions may be found in these data. 
How does the accuracy under speed compare with that under power? 
Using as an index of accuracy the proportion of the total number of 
items attempted (number right plus number wrong plus omits) 
which were right, for Verbal Form 1 the proportions were .48 for 
the speed and .48 for the power condition, while for Verbal Form 2 
the corresponding proportions were .49 and .50. On the other hand, 
the pairs of proportions were .46 and .52 for Mathematics Form 1 
and .53 and .58 for Mathematics Form 2. Hence the observed accu- 
racy for the power condition was about the same as that for the 
speed condition for the verbal forms, while for the mathematics 
forms the accuracy for the power condition was somewhat greater 
than for the speed condition. These results are especially of interest 
considering that with increasing difficulty of material there would 
normally be a decrease in observed accuracy. Thus the additional 
time helped accuracy for both types of material. 

The proportions given under the cell frequencies in Tables 8-11 
are helpful in determining what happened when the students, upon 
being allowed more time, changed marked responses already made or 
marked responses to items previously omitted or not reached. Evi- 
dently, right responses were very rarely changed. For the wrong re- 
sponses, the tables show only the changes from wrong responses to 
omits or to right responses. Data basic to Tables 5 and 6 reveal that 
the most frequent changes in wrong responses were to other wrong 
responses. (These changes numbered 161, 170, 57, and 67 for Ver- 














300 PSYCHOMETRIKA 


bal Forms 1 and 2 and Mathematics Forms 1 and 2, respectively.) 
Wrong responses were less frequently changed to right responses. 
Less than half of the omits for the verbal forms were later marked, 
and, when marked, more often were made wrong responses than 
right. The number answered correctly was greater than would have 
been the case if the marking had been done on a completely random 
basis, however. For the mathematics forms only about two-fifths of 
the omitted items were given marked responses under the longer time 
limit, but these were distinctly more often right than wrong. While 
these second attempts at omitted items were more successful for 
mathematics than for verbal material, it must be noted that the over- 
all difficulty of the mathematics tests was somewhat less than that 
for the verbal tests under the power condition. 

Analysis of the power responses to items not reached under the 
speed condition shows that for each of the four forms there was a 
higher proportion of omits on these items than there had been on the 
items previously attempted. The number of right responses made 
under the longer time limit to items not previously reached when com- 
pared to the total number of attempted items shows that for the ver- 
bal forms the accuracy on these items (.388 for Form 1 and .42 for 
Form 2) was less than that for items tried under the short time limit 
(.48 and .49, respectively). For the mathematics forms, the accuracy 
on these previously-not-reached items (.50 for Form 1 and .55 for 
Form 2) was greater than that for items tried under the short time 
limit (.46 and .52, respectively). 

That the over-all accuracy was about the same under the speed 
and power conditions for the verbal material seems perplexing in 
view of the statements just made. The explanation apparently is to 
be found in the fact that of the changes made in previous wrong re- 
sponses to other types of response (i.e., right or omit), most were to 
right responses. 

From the analysis of Tables 7-11, then, a somewhat different 
test behavior is indicated for the mathematics material than for the 
verbal items. Even when the slightly greater over-all difficulty of 
the verbal forms is allowed for, it still seems likely that the greater 
amount of time helps accuracy on mathematics items more than on 
verbal items. One explanation which might be advanced is that the 
type of task involved was different in the two instances. The verbal 
antonyms were probably either known at once or were not known — 
answers were largely a matter of recognition. On the other hand, the 
mathematics items usually involved problem-solving, and were fig- 
ured out over the extra period of time. 




















WILLIAM G. MOLLENKOPF 


Item-Analysis Data 


Item analyses were carried out for each test form and each group 
of students under each time condition.* Three measures were com- 
puted for each question in each analysis: p, the proportion right of 
those attempting the item; 4, another index of item difficulty,+ and 
Tis » the biserial coefficient of correlation between performance on the 
item (for those attempting it) and total test score. 


In order to determine the effects of item placement and of test 
timing on these item statistics, a series of scatter plots was made for 
each statistic. Three types of plot were made: power versus power, 
power versus speed, and speed versus speed. In the first type, power 
data for one group of students were plotted against similar data for 
a second group. In some of these plots, power data for the same test 
form given to two different groups were compared, while in others 
of the plots, power data for two different test forms were involved. 
It is to be noted that power values as well as speed values were com- 
puted for the different indices for the groups taking the tests first 
under speed and then under the power condition. In the second type 
of plot, data based on performance under speed were compared with 


*While the limited supply lasts, the author will send a copy of the item- 
analysis data to any interested reader who requests one. 

yA is defined by the relation 

A=M,+0%,2, 
where M, is the mean criterion score (raw test score converted linearly to a 
scale with mean of 13 and o of 4) of persons attempting the item, ¢, is the stand- 
ard deviation of these scores, and x is the abscissal distance in sigma units 
from the mean of a unit normal distribution to the point whose ordinate cuts off 
an area corresponding to the proportion of those attempting the item who get it 
right. The sign of x is taken to be positive when p < .50, negative when p > .50. 
For both p and A, number attempting is defined to include number right, num- 
ber wrong, and number omitting, but to exclude the number of persons not reach- 
ing the item. In terms of raw scores, delta can also be expressed as follows: 
ian yp Rite gag 

r ae 


> 


-_ 


where 

is the mean of the raw test scores for the entire group, 

is the mean of the raw test scores for those attempting the item, 

is the standard deviation of the raw test scores for the entire group, 

is the standard deviation of the raw test scores for those attempting 
the item, 


with x defined as before. 


a Mis 


il. Til, | 














302 PSYCHOMETRIKA 


data based on performance under power. Since in some instances a 
group took a test form first under the speeded condition and then 
under the power condition, some of these plots involved data for the 
same group on both axes, while other plots were based on data for 
two different groups. In the third type of graph, speed data for one 
group were plotted against speed data for a second group. Because 
of the design of the experiment, two test forms were involved in 
these plots. In order more easily to identify any trend that might be 
present in the data, the points were plotted using symbols which in- 
dicate which set of items in the test named along the horizontal axis 
included a particular item. In the inspection of the power versus 
power plots, it was found helpful to make allowances for ary differ- 
ence between test means while looking for trends. 

While some 85 graphs were plotted from the item-analysis data 
during the course of the study, space will permit the exhibiting of 
only a small number in this report. These are Figures 1-8. In the 
figures small triangles represent values for items of the first set in 
the form noted along the X-axis, solid circles represent items in the 
middle set, and small crosses represent the third set of items in this 
form. Thus when Form 1 data are plotted along the X-axis and Form 
2 data along the Y-axis, a triangle locates the values for an item 
appearing early in Form 1 and late in Form 2. No point is based on 
fewer than 15 cases. To emphasize the effects of speed, points based 
on 15-25 cases are ringed with circles. 


Results: Item Difficulty Index p 


For the power versus power plots in which test form (and hence 
item placement) was the same for the two groups of students, there 
was no consistent trend observed in the data for either the verbal 
or the mathematics forms, the points falling in a diagonal band. This 
shows that the groups were similar on these tests, and provides an 
initial check on the data before other comparisons are made. 

In the power versus power plots in which data from one form 
are compared with data from the second form, it was noted than an 
item in the verbal forms tended to have a somewhat higher p when 
it occurred early than when it came late in the test. (It is recalled 
that the first and third sets of items in Form 1 were interchanged to 
create Form 2). However, no similar trend was consistently noted in 
the various plots involving mathematics test data. 

The change in the difficulty index p due to change in the position 
of the item in the verbal tests was rather small (being of the order 











or OO OS OD mw Ww 


Ba Ft 


—— OO a Te Ce 











WILLIAM G. MOLLENKOPF 303 


of .05), and in the author’s opinion may probably be neglected safely 
in the usual applications of item-analysis data in test construction. 
The phenomenon may perhaps be attributed to greater carelessness 
towards the end of the test, resulting from increasing fatigue and 
mounting pressure to finish the test. 

In the power versus speed plots for the verbal forms, the power- 
condition p’s were generally higher than the speed-condition p’s for 
items appearing early in the test, while for items appearing late un- 
der the speed condition, the speed p’s generally exceeded the power 
p’s. High drop-out occurred before the speed p’s exceeded the power 
p’s. In the power versus speed plots for the mathematics forms, the 
power p’s tended to be somewhat higher in general than the speed 
p’s, although there were many inconsistencies to this relationship. 
There was no clear-cut tendency for the items coming late in the 
mathematics forms to have higher observed p’s for the speed than 
for the power condition, as had been the case for the verbal forms. 

We might therefore say that when all students attempt an item 
in a test similar to one of those in the present study, more of the 
students will mark the correct response if they are given plenty of 
time than if they are pressed for time. In regard to the higher p’s 
for the verbal items appearing late under speed, it is to be noted that 
Table 5 shows that for the verbal forms there are substantially rela- 
tionships between number of items attempted (i.e., rate of work) on 
the one hand and number right under speed and number right under 
power on the other hand. In Table 6 it can be seen that the corre- 
sponding relationships for the mathematics ferms are lower—in fact, 
one of the correlations is slightly (nonsignificantly) negative. 

The progressive drop-out that occurred under the speed con- 
dition, then, meant that not only was the group who reached and at- 
tempted an item toward the end of a verbal form a faster group, but 
also that it was a more able group. The data on the p values of items 
appearing near the end of a verbal test reflect this selective drop- 
out, with the speed p’s for these items exceeding the power p’s. Since 
try-out booklets are often administered under somewhat speeded con- 
ditions, the test constructor must consequently beware of p values 
for items toward the end of the test when derived under these cir- 
cumstances. In the present study, a drop-out of about half or more 
of the cases was required before the speed p’s were consistently and 
appreciably higher than the power p’s for the same verbal items. 

It was possible to plot only one speed versus speed graph for 
the verbal forms and one for the mathematics forms. Results in 
both cases were consistent with those presented above for the power 














304 PSYCHOMETRIKA 


versus speed comparisons. With Parts 1 and 3 interchanged in the 
two forms yielding the data, the set of items coming first for one 
group came last for the other, and vice versa. For the verbal forms, 
the items in a given set in general had higher p’s when they occurred 
late than when they occurred early. (For each verbal form it is to be 
remembered that drop-out by the beginning of the third set of items 
was quite great, namely 41% for Form 1 and 44% for Form 2.) For 
the mathematics forms, no similar trend was observed, but a greater 
variability about the 45° regression line was noted throughout the 
plot. 


Results: Difficulty Index A 


The difficulty index A involves a correction which was intended 
to yield difficulty values for items on which high drop-out had oc- 
curred similar to the values these items would have had if they had 
come early in the test. One of the questions raised in the study was: 
How successful is this correction? 

For the power versus power plots in which test form (and conse- 
quently item placement) was the same for the two groups of stu- 
dents, there was no consistent trend in the A values for either the 
verbal or the mathematics forms, the points falling in a diagonal 
band. This is evidence that the groups were similar, and of course 
is a consequence of the previous findings on p for the corresponding 
plots, since A is merely a re-scaling of » when all persons attempt 
every item. 

In the power versus power plots in which A values from one 
form given to one group were involved along one axis and A values 
from the second form given to another group were involved along 
the other axis, the A value for an item in the verbal forms in general 
was slightly lower when the item came early than when it came late 
in the test. (This finding is consistent with that for p, since the low- 
er the A, the easier the item is indicated to be.) Again, as was true 
for p, no similar position effect was clear among the mathematics 
data on A. The position effect on item A noted for verbal items was 
small (being of the order of .5 A) and can probably be disregarded 
safely in ordinary applications of A in test construction. 

In the power versus speed plots for both the verbal and the 
mathematics forms, the speed-condition A values exceeded the power- 
condition A values in a great preponderance of the pairs compared, 
in each of the plots. Furthermore, the speed A values for items to- 
ward the end of the test, reached by only a small proportion of the 











Om @ 


—— 
) 


we Te "es CG CD fete 











WILLIAM G. MOLLENKOPF 305 


students under the speed condition, exceeded the corresponding pow- 
er A values considerably more than was the case for earlier-placed 
items. In short, the farther along the item occurred under the speed 
condition, the more the speed A value tended to be greater than the 
power A value. 

Only one speed versus speed graph could be plotted for the ver- 
bal data, and one for the mathematics data. Results in both instances 
were consistent with those just presented for the power versus speed 
comparisons. In each of these two graphs, two test forms (and two 
groups) were involved. Item placement differed for the two groups, 
an item appearing early for one group coming late in the test for the 
other. While the items sharing a similar position for the two groups 
(i.e., the middle set) had fairly similar A values for the two groups, 
large differences, having a distinct trend to grow greater toward the 
end of the set, were observed between corresponding A values for the 
items in the other two sets. Items in the set appearing last for one 
group had consistently higher A values for that group than when 
they occurred in the first set for the second group. The closer to the 
end of the test the item appeared, the more its A value for the group 
taking this form exceeded its A value for the group taking the other 
form, in which the item, of course, came much earlier. 


The findings presented above for A which are based on compari- 
sons involving speed-condition A values can be seen to be inconsistent 
with those presented earlier for p. For verbal items it was shown 
that drop-out due to speeding led to p values higher than correspond- 
ing power values, while for mathematics items no consistent effect was 
observed. However, speed A values were found with few exceptions 
to exceed power A values for the same items. These data on A taken 
at their face value would indicate that the groups reaching items un- 
der speed find these much harder than do groups who are given plen- 
ty of time. As progressive drop-out occurs, the groups are indicated 
by A to become less and less able to cope with the items. Clearly the 
results on A in which speed is involved are contradictory to other 
available evidence. 


For the data to have shown the correction involved in A to have 
been of the proper amount, the speed and power 4 values should 
have plotted essentially linearly for a given test form. Actually the 
speed A values were considerably higher (and thus indicated greater 
difficulty) than the power A values for items toward the end of the 
test. When a set of p values expressed on the A scale without the 
correction was compared with the usual A values, it was very clear 













































306 PSYCHOMETRIKA 


that the so-called correction actually caused the A values to be far- 
ther off on the “hard” side than the (transformed) p’s were on the 
“easy” side, for these late-appearing items. 

It is of course to be noted that it is the correction factor in A 
which is shown to be at fault. The re-scaling feature of A, involv- 
ing a normal-curve transformation from p, leads to certain desirable 
characteristics in the difficulty indices. This type of transformation 
used in the original development of A (see 1, p. 356) has more re- 
cently been used by Davis in his monograph (3) on item-analysis 
data and their use. It seems proper to emphasize that the 4 index 
remains unquestionably a useful measure of item difficulty when all 
or most candidates attempt every item.* 


Results: Item-Test Biserial Correlations 


For each group tested only under the power condition there was 
computed one set of item-test biserial coefficients of correlation. On 
the other hand, for the groups tested first under the speed condition 
and then under the power condition, four sets of biserial 7’s were 
computed. For each item, both a speed response and a power re- 
sponse were available. Each of these was correlated with both the 
speed score (number right under speed) and the power score (num- 
ber right) on the total test. 

In the sets of graphs in which power response-power score bi- 
serials were plotted on both axes, no clearly consistent trend was 
found in the biserials, either for the plots in which test form was the 
same along both axes or for the plots in which two forms (and hence 
different item placements) were involved on the two axes. With test 
form constant, when the power response-power score biserials were 
plotted against speed response-speed score biserials, it was noted 
that whereas for items early in the test the power-power values in 
general tended to exceed the speed-speed values, for items late in the 
test (and hence on which drop-out was quite high) the speed-speed 
values tended to exceed the power-power values. These findings held 
for both the verbal and the mathematics forms. 

These data on the biserials present a challenging problem in in- 
terpretation. The phenomenon just described of the 7’s for items 
near the end of a speeded test being considerably higher than those 
for early items is, of course, one that has often been observed. (See, 

*Delta is frequently used for equating CEEB tests which have sufficient 
items in common, or in adjusting item difficulties obtained for different popula- 


tions. (This type of change in item difficulty was not touched upon in the pres- 
ent studv.) 











WILLIAM G. MOLLENKOPF 307 


for example, references 2 and 6.) In the present study, various spe- 
cial biserials were obtained, and the plotting of graphs contrasting 
different types of these was carried out in order to secure further in- 
formation about this phenomenon. 


TABLE 12 
Trends Observed in Plots of Item-Test Biserial Correlations 
When Values Derived from Various Timing Conditions, 
Item Scores, and Test Scores Are Compared 








Type of Biserial Plotted~ 





Set A Set B 
Item Criterion Item Criterion 
Score Score Score Score Finding 

1. Power Power Power Power No trend observed. 

2. Power Power Speed Speed Power-power generally higher than 
speed-speed values for all but late 
items; speed-speed generally higher than 
power-power values for late items. 

8. Speed Speed Speed Power No trend; very good agreement between 
Sets. 

4, Power Power Power Speed No clear-cut trend observed. 

5. Speed Speed Power Speed Speed-speed higher than power-speed 
values for items late in test. 

6. Power Power Speed Power Speed-power higher than power-power 
values for items late in test. 

7. Speed Speed Speed Speed Great variability, biserial distinctly 


higher for item when near end than 
when near beginning of test. 





Obviously two aspects of the biserial are to be considered: (1) 
the test score used as the criterion, and (2) behavior on the specific 
test item. The various explanations of why the speed-speed biserials 
exceed the power-power values for items late in the test can probably 
be subsumed under the following two categories: (1) differences be- 
tween the two criteria, and (2) differences between the performance 
on a given item late in the test of the entire group when essentially 
unhurried and the performance of the smaller group reaching and 
attempting this item while pressed for time. Table 12 shows how 
these possibilities were explored by means of comparisons of several 
sets of biserials in a series of graphs. In all of the graphs, test form 
was the same for the two sets of data. The findings hold for both 
the verbal and the mathematics forms. 

First the hypothesis was considered that the difference in cri- 














308 PSYCHOMETRIKA 


terion accounts for the fact that biserials of late items are higher 
under the speed than under the power condition. In the typical try- 
out situation, in which an appreciable proportion of the testees do 
not complete the test within the time limit imposed, the item-test bi- 
serials approach what we have termed the speed-speed variety; and 
hence the question arises whether use of some criterion other than 
total number right within the time limit would be advisable, as, for 
example, number right on that portion of the items attempted by all 
subjects. In the two sets of plots in which everything but the cri- 
terion was held constant (comparisons 3 and 4 in Table 12), there 
was very good agreement between the two sets of data, with no 
trend observable. These facts give little reason for preferring the 
power score over the speed score as the criterion. 

The next hypothesis to be tested was that the higher biserials 
for items toward the end of a speeded test were due to the fact that 
the behavior of the subjects in the select group attempting these 
items under speed differed from the behavior of the entire group 
working under unhurried conditions. Two sets of graphs yielded 
pertinent information. In one set the the criterion was number right 
under speed, and the only factor varied was the condition under 
which item responses were made (speed or power). That is, speed 
response-speed score biserials were plotted against power response- 
speed score biserials. (Comparison 5, Table 12). While for early 
items in the test the power-speed values were about the same or 
slightly higher than the speed-speed values, for late items (where 
drop-out was high) the speed-speed values in general exceeded the 
power-speed ones. In the second pair of plots (Comparison 6, Table 
12), the criterion used was test score under the power condition, with 
the condition under which the responses were made again being the 
only variable. For these, the power-power biserials for items early 
in the test were about the same as, or slightly higher than, the speed- 
power values. On the other hand, for items late in the test the 
speed-power biserials tended to exceed the power-power biserials. 

The graphs in which the two plotted sets of biserials involved 
the same criterion with one set involving item performance under 
speed and the other, item performance under power, showed that for 
items late in the test the correlations (for the group attempting the 
item) of the speed responses with the criterion were higher than the 
correlations of the power responses with the criterion, regardless of 
whether the criterion was number right under the speed or number 
right under the power condition. Hence we are led to stating that 
the late item-higher biserial phenomenon observed in the item-anal- 








ew ws Le a . 


"rE eS UmMheltC<C NS CétC 


ee. " 











WILLIAM G. MOLLENKOPF 309 


ysis data for speeded tests seems to be the consequence of a differ- 
ence between the performance on such an item shown by the group 
who reach and attempt this item under the speed condition and the 
performance of the entire group working under less hurried condi- 
tions. 

To explore the matter further, the writer compared for each test 
form the speed criterion-score distributions for the groups attempt- 
ing several late items with the criterion-score distribution for the 
entire group taking the form. First the means and variances were 
considered. If the total scores were perfectly correlated with num- 
ber attempted, the group attempting any item after drop-out begins 
would have a higher mean and lower variance of scores than the 
initial group. While for all forms the means for the groups attempt- 
ing late items did increase as the end of the test was approached, 
the variance did not correspondingly decrease for items in all four 
forms; in fact, this happened only for items in Verbal Form 1. Even 
there, the decrease was slight. For items in the other three forms, 
the criterion-score variance for the group attempting an item grew 
larger, the nearer the item was to the end of the test. 

Next the actual shapes of the distributions were examined. The 
score distributions for the groups attempting the late items were 
much flatter than those for the entire groups, being almost rectangu- 
lar in several instances. When the distributions of the passing and 
failing groups were considered, it was evident that the very high 
r’s for these late items occurred when the criterion-score distribu- 
tions of passers and failers approached a non-overlapping condition. 

Why are these distributions flat? This question can perhaps 
best be answered by considering what persons attempt late items, 
and why. The evidence of this study indicates that the best persons 
(in terms of scores) attempt the most items, and usually are suc- 
cessful. The even spread of scores down into the low range indicates 
that some persons of low and of average ability also mark late items. 
Two possible explanations are suggested as to why these persons of 
lower ability thus hurry through the test: (a) these persons do net 
recognize their own limitations, believing themselves to be more ca- 
pable than they really are, and (b), these persons are test-wise in- 
dividuals who hope to better their scores by attempting a great many 
items. 

It is the writer’s belief, then, that the high biserials for late 
items in a speeded test are a reflection of the inappropriateness of 
this statistic under the circumstances, the situation being somewhat 











310 PSYCHOMETRIKA 


akin to that cited by Richardson* in a footnote in Kroll’s article (5, 
pp. 429-430) and by Kelley (4, pp. 376-377). The criterion-score 
mean and variance do not provide sufficient clues as to what is hap- 
pening with late items; the actual shape of the score distribution 
seems to provide a better indication that biserial r is an inappropri- 
ate measure for evaluating items occurring late in a speeded test. 


Summary and Conclusions 


When students were given more time in which to work on tests 
in the verbal and mathematics area after having initially been 
stopped at the end of a time limit so short that very few persons had 
attempted all items, their accuracy was about the same under the 
two time conditions for the verbal material; but was somewhat great- 
er for the mathematics material under the power condition than it 
had been under the preceding speed condition. Right responses un- 
der the speed condition were practically never changed under the 
power condition. Wrong responses were most frequently changed to 
other wrong responses, and less frequently to right responses when 
added time was given, although the absolute frequency of any changes 
at all was low in proportion to the number of speed responses in- 
volved. Items omitted under the speed condition were in a majority 
of instances still omitted when extra time was given. When later 
marked, items originally omitted were more often marked incorrectly 
than correctly for the verbal forms, but were more often marked 
correctly than incorrectly for the mathematics forms. 

For both types of material there was a higher proportion of 
omits under the power condition among items not reached under 
speed than there had been among items attempted under the speed 
condition. Accuracy under the power condition on items not reached 
under the speed condition was slightly less than the over-all speed 
accuracy for the verbal forms and somewhat greater than the over- 
all speed accuracy for the mathematics forms. 

The following statements follow from the analysis of the item 
data presented earlier: 


1. Difficulty index p: 
a. When power data only were compared, the p value of an 
item was slightly higher when the item appeared early than 


*Richardson states that only when the continuous distribution is normal is 
the upper limit of the biserial coefficient unity, and that when the continuous 
distribution is flatter than normal and the relationship is high, the computed co- 
efficient may be greater than one. 








WILLIAM G. MOLLENKOPF 811 


when it came late in the test, for the verbal forms. No simi- 
lar position effect was noted in the comparisons involving 
only power data on the mathematics forms. 

When speed and power data were compared, before drop-out 
began, the power p’s usually exceeded the speed p’s for the 
verbal forms. With high drop-out, the speed p’s exceeded 
the power p’s for the verbal tests. No corresponding rela- 
tionships were found in the comparisons involving speed and 
power data for the mathematics forms. 

When speed data for one form were compared with speed 
data for a second form in which the item arrangement was 
different, for the verbal forms the p of an item was distinct- 
ly higher when the item came late than when it came early. 
No corresponding trends were noted for comparisons involv- 
ing speed data on two mathematics forms. 


2. Difficulty index A: 


a. 


When power data only were compared, the A value of an 
item in the verbal forms was slightly lower when the item 
appeared early than when it came late. No similar position 
effect was noted in the comparisons involving only power 
data on the mathematics forms. 

When speed and power data were compared, for both the 
verbal and the mathematics forms the speed A values gen- 
erally exceeded the power 4 values, with the difference be- 
tween the two values for an item becoming considerable when 
the item appeared near the end of the test where drop-out 
was very high. The correction factor in A was thus found 
to be an over-correction. 

When speed data for one form were compared with speed 
data for a second form in which the item arrangement was 
different, the A value of an item was found to be considerably 
higher when the item came late than when it came early. 
This was additional evidence that the correction factor in- 
volved in 4 is an over-correction. 


3. Item-test biserial coefficient of correlation: 


a. 


When power data only were compared, the 7,;, of an item 
was not found to vary noticeably with different placements 
of the item in the test, for either the verbal or the mathe- 
matics forms. 

When speed and power data were compared, the speed re- 
sponse-speed score biserials generally exceeded the power re- 











PSYCHOMETRIKA 


sponse-power score biserials for items when drop-out on the 
items was high (about half the cases or more). 

c. When speed data for one form were compared with speed 
data for a second form in which item arrangement was dif- 
ferent, the 7,:;, of an item was distinctly higher when the item 
appeared near the end of the test than when it came at the 
beginning; and these sets of biserials showed much greater 
variability about the main 45° diagonal line than that found 
in previous comparisons. 

d. These higher biserials for items appearing late in a speeded 
test were found to be a reflection of the inappropriateness of 
the biserial coefficient when the continuous-score distribution 
for those attempting the item is markedly flatter than the 
normal curve. The distribution curves for the scores of those 
attempting a number of late items were examined, and found 
to approach the rectangular shape. 


4. Since a high value of p and a low value of A both are intended to 
indicate that an item is easy, the findings on p and A reveal that 
the correction involved in A, which was intended to overcome 
the observed tendency of p’s to become higher as the group reach- 
ing an item becomes more select through drop-out under speed, 
actually is a considerable over-correction.* 


Recommendations: Severe enough troubles are encountered with 
all of the three item indices (y, 4, and %:,) when drop-out is very 
great (say over half the cases) to indicate that indices derived under 
such circumstances are of dubious value in test construction. Indeed, 
use of them may be quite misleading. The writer therefore recom- 
mends that these item indices not be relied upon, or in fact even com- 
puted, for items on which the drop-out has reached fifty per cent of 
the beginning group. In the setting of time limits for try-out forms, 
whenever it is highly desirable to secure useful information about 
the characteristics of every item, the test worker should allow ade- 
quate time for at least half of the group (preferably more) to at- 
tempt every item in the test. 


REFERENCES 
1. Brigham, C.C. A study of error. New York: College Entrance Examination 
Board, 1932. 
2. Conrad, H. S. Characteristics and uses of item-analysis data. Psychological 
Monographs: General and Applied, 1948, No. 8 (Whole No. 295). 


*Attempts are under way at the Educational Testing Service to develop a 
better means of correcting the difficultv index for the effects of drop-out. 











WILLIAM G. MOLLENKOPF 313 


3. Davis, F. B. Item-analysis data. Harvard Education Papers, No. 2. Cam- 
bridge: Graduate School of Education, Harvard University, 1946. 

4. Kelley, T. S. Fundamentals of statistics. Cambridge: Harvard University 
Press, 1947. 

5. Kroll, A. Item validity as a factor in test validity. J. educ. Psychol., 1940, 
33, 425-436. 

6. Wesman, A. G. Effect of speed on item-test correlation coefficients. Educ. 
psychol. Meas., 1949, 9, 51-57. 


FIGURE | FIGURE 2 


P: VERBAL FORM 2, GROUP 2, POWER 





P:MATH. FORM 2, GROUP 3, POWER 
uo 


* - = ” ~ _—-— ¢ + 32 SR A ET ee 
P: VERBAL FORM |, GROUP |, POWER P: MATH. FORM I, GROUP 2, POWER 
FIGURE 3 FIGURE 4 


P: VERBAL FORM !, GROUP 3, SPEED 
P: MATH. FORM |, GROUP I, SPEED 





d 2 J 5 : J 
P:VERBAL FORM!, GROUP J, POWER P: MATH. FORM 1, GROUP 2, POWER 











10 





10 










GROUP 3, SPEED 
= 8 & Ss 353 8 & 


A: VERBAL FORM |, 
wo 


8 


a 


a 


13 


4:MATH. FORM |, GROUP |, SPEED 


wo 


7 





A: VERBAL FORM I, GROUP |, POWER 





4: MATH. FORM I, GROUP 2, POWER 


PSYCHOMETRIKA 





FIGURE 5 


ii 22 8 oe 1S 6 te 18 «68S 


FIGURE 6 


11 12 6861456 SCI: 18S 20 














WILLIAM G. MOLLENKOPF 
































FIGURE 7 
1. 
Q 
Ww 10 2 
" + 
Qa 
9 
tah Ra 
a 
” 8 or ® 
. * 
a7 a4 a 
3 car 
a 5 e 
o 6 “3° . 
- we 8 
aa 
z 5 rN 
2 4 : 4 
a a ie 
< > 
e 3 * 
uJ 
> 
“e 2 
a + - 






































i @ 2 @ 2S 2 22 SB 
This: VERBAL FORM I, GROUP |, POWER-POWER 


FIGURE 8 








io 


‘bis: MATH. FORM 2, GROUP 4, SPEED-SPEED 


Oo) £2 oe & © rf 8 8S HO kt 
This: MATH. FORM 2, GROUP 3, POWER- POWER 

















PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


A PROPOSED METHOD FOR ABSOLUTE RATIO SCALING* 


ANDREW L. COMREY 
THE UNIVERSITY OF ILLINOIS 


A method of ratio scaling is described for treating compara- 
tive judgments of paired stimuli. A method of comparative judg- 
ment developed by Metfessel is employed. Formulas for scale val- 
ues and the solution of a sample problem are provided. The method 
is designed to provide internal-consistency checks on the scale val- 
ues. Experimental interpretations of equal-unit and ratio properties 
of measurement scales are inherent in the procedure. 


A method of reporting comparative judgments has recently been 
proposed by Metfessel which is designed to yield ratio scales.¢ In 
essence, the method consists of instructing judges to divide a total 
of 100 points among any specified number of samples being judged. 
The points are divided in such a manner as to reflect the absolute 
ratios existing among the samples. The method of scaling to be pro- 
posed here employs this type of judgment to build a ratio scale with 
many desirable features. 


I. Theoretical Formulation 

The fundamental method of presenting stimuli for comparative 
judgment is based upon the traditional paired-comparisons technique. 
That is, the stimuli are presented to the judges in all possible com- 
binations of two stimuli for each judgment, taking care to balance 
the order of presentation as is customary in paired-comparisons scal- 
ing. Instead of merely indicating which of the two stimuli presented 
is greater, or preferred, the judges divide 100 points between them 
in accordance with the absolute ratio of the greater to the lesser. 
Thus, if a judge assigns 60 points from the 100 points to Stimulus 
A, and 40 points to Stimulus B, he is representing Stimulus A as 
being in the ratio of 60 to 40, or 3:2, with Stimulus B. A separation 
of 75 to 25 would indicate a ratio of 3:1. 

The samples to be scaled are then arranged in rank order ac- 

*The author wishes to express his appreciation to Professor J. P. Guilford 
fen — comments and encouragement in connection with the preparation 


+Metfessel, Milton. A proposal for quantitative reporting of comparative 
judgments. J. Psychol., 1947, 24, 229-235. 


317 














318 PSYCHOMETRIKA 


cording to the total number of points received in comparative judg- 
ments with all the other samples, combining points for all judges 
and all comparisons for every judge. For purposes of identification, 
the number 1 is assigned to the sample receiving the highest num- 
ber of points, the number 2 is assigned to the sample receiving the 
next highest number of points, and so on, down to the number 7 
for the last sample. Letting y and x represent sample identification 
numbers, p,-: denotes the average number of points from 100 that 
has been given to sample y in its comparisons with sample x. The 
value prr = Py — 50, which means that any sample compared with 
itself is assumed to receive 50 points from the 100 possible points. 

When the average number of points given to each sample in com- 
parison with each other remaining sample is known, the absolute 
ratios between adjacent samples in the rank-order series can be ob- 
tained from the following formula: 


1 = Pye! Poy 





Ryge) = . ‘ 1 
sie M— 1 gr Deystye/Poryr) “ 
yen arty 

where 
Rywyi1y = The ratio of sample y to the sample next following 
it in the rank-order series. 
n = The number of samples in the rank-order series. 
Pyz = The average number of points from 100 given to 
sample y in its comparisons with sample zx. 
Dey = The average number of points from 100 given to 


sample x in its comparisons with sample y. 


Once the values, R,;y.1) , are known, i.e., the ratios of samples in the 
rank-order series to those immediately following them in the series, 
the scale values for the samples can be obtained from the following 
formula: 


Sy = [Ro-tyn] X [Roen-2)(n-ry] X 0+ X [Rye] , (2) 
where: 


S,= The scale value for sample y of the rank-order series. 
Other symbols are the same as before. 


The equation is not soluble for the highest numbered mem- 
ber of the rank-order series. The scale value of 1.00 is arbitrarily 
assigned to this sample, since it is the smallest or least preferred 
sample of the series. Formula (2) provides scale values for the re- 














ANDREW L. COMREY 319 


maining samples as multiples of 1.00, which is the scale value as- 
signed to the last member of the rank-order series. 


II. A Practical Example 

Forty-seven students in two elementary statistics courses at the 
University of Illinois served as subjects in an experimental check 
on the method. A black line was drawn on each of nine 12” X 12” 
pieces of white cardboard. The lengths of the lines, as determined 
by measurement after the experiment had been completed, were as 
follows: A, 9.95”; B, 8.12”; C, 6.03"; D, 7.97"; E, 4.23"; F, 3.17"; 
G, 5.72”; H, 1.25”; and I, 5.38”. The cards were exposed side by side 
to the subjects for 15 seconds in every combination of 2 cards, ac- 
cording to a predetermined order. This order was such that each 
line appeared equally often as a left member and as a right mem- 
ber. Furthermore, no line occurred in two successive pairs and in 
general the individual lines were well spaced throughout the series. 
Lines were chosen as the sample medium because of the possibility 











TABLE 1 
Averages of Points Assigned in Comparative Judgments 
Line 1(A) 2(B) 38(D) 4(C) 5(G) 6(1) T(E) 8(F) 9(H) 
1(A) 43.47 44.91 34.81 36.09 34.58 28.74 25.21 11.00 
2(B) 56.53 49.06 41.89 41.13 388.70 32.02 26.36 12.91 
3(D) 55.09 50.94 42.77 41.04 37.40 34.51 25.79 12.79 
4(C) 65.19 58.11 57.23 49.26 46.40 40.28 33.15 17.74 
5(G) 63.91 58.87 58.96 50.74 48.47 41.34 34.40 16.00 
6 (1) 65.47 61.30 62.60 53.60 51.53 44.83 34.88 15.85 
7(E) 71.26 67.98 65.49 59.72 58.66 55.17 43.57 21.06 
8(F) 74.79 73.64 74.21 66.85 65.60 65.17 56.43 29.68 


9(H) 89.00 87.09 87.21 82.26 84.00 84.15 78.94 70.32 














TABLE 2 
Ratios of Points Assigned in Comparative Judgments 
Line 1(A) 2(B) 3(D) 4(C) 5(G) 6(I) T(E) 8(F) 9(H) 
1(A) tt 82 58 56 53 40 34 eZ 
2(B) 1.30 .96 she -70 .63 AT 36 15 
3(D) 1.28 1.04 75 70 .60 53 85 15 
4(C) 1.87 1.39 1.34 97 87 .67 50 .22 
5(G) int @14s ‘Lae ROS 94 -70 52 19 
6(1) 90' £58 N67 “146° 106 81 53 19 
7(E) 248 2:32 1:90 48. 142° 1:28 yy 27 
8(F) 2.97 2.79 2.88 2.002 1.91 187 136 42 





9(H) 8.09 6.75 682 4.64 5.25 5381 3.75 2.37 

















320 PSYCHOMETRIKA 


TABLE 3 
Ratios of Adjacent Samples in the Rank-Order Series 
1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 
Lines (A-B) (B-D) (D-C) (C-G) (G-I) (I-E) (E-F) (F-H) 














1(A) 94 1.55 95 1.06 1.32 1.18 2.82 
2(B) 1.30 1.33 1.03 re ie 1.34 1.31 2.40 
3(D) 1.18 1.04 1.07 i 1.13 1.51 2.33 
4(C) 1.35 1.04 1.34 34 1.30 1.34 2.28 
5(G) 1.24 .99 1.40 1.03 1.34 1.35 2.74 
6(1) 1.20 95 1.44 1.09 1.06 1.53 2.79 
T(E) a7 441 1.28 1.04 1.15 1.23 2.85 
8(F) 1.08 97 1.42 1.06 1.02 1.44 1.30 
9(H) 1.20 .99 1.47 .88 .99 1.42 1.58 2.37 
Average 

Ratio 1.26 1.00 1.40 1.02 1.08 1.32 1.39 2.57 
Measured 


Ratio 1.23 1.02 1.32 1.05 1.06 1.27 1.38 2.54 





of checking the scale values obtained from judgments by actual 
physical measurement. 

The computations leading to scale values are illustrated in 
Tables 1, 2, and 3. Table 1 gives the number of points from 100 
assigned to the sample lines listed by letter across the top of the 
table in their comparisons with the lines listed down the first col- 
umn. For example, Line B received 43.47 points when it was com- 
pared with Line A. Line A, of course, received 100 — 43.47 or 56.53 
points in the same comparison. These values represent means of 
points assigned by 47 subjects. 

Table 1 has been arranged in such a way that the lines are 
listed across the top in their order of size as judged by the number 
of points they received. Thus, A was the longest line, as judged, and 
H the shortest. In Table 2 are listed the ratios of the number of 
points assigned to the lines along the top row to the number of points 
assigned to the lines as listed in the first column. That is, in Table 
1, if the value at Row A, Column B, 43.47, is divided by the value 
at Row B, Column A, 56.53, the result is .77, which is recorded in 
Table 2 at Row A, Column B. Its reciprocal, 1.30, is listed at Row 
B, Column A. 

The next step is to find the ratio of the largest line’s length to 
that of the next smaller one, and so on. With these ratios, scale 
values can be constructed. Such ratios may be obtained by taking 
directly the entries in the cells relating A to B, B to D, and so on, 
in which the ratios are for a line to the next smaller line. Such a 








ANDREW L. COMREY 321 


procedure would waste much of the available data, however, and 
might fail to give a set of ratios that are internally consistent. Sup- 
pose, for example, that we consider two ratios, A to D and B to D. 
If A/D = x and B/D = y, then A/B = x/y. This fact enables a 
computation of ratios between two line lengths, A and B, by virtue 
of the fact that ratios have been obtained experimentally for A and 
B in relation to other lines. So, it is possible to set up Table 3 which 
contains 8 values for each ratio between two adjacent lines in the 
rank-order series. 

The entries in Table 3 are obtained by dividing the values in 
Table 2 by those in the column immediately on the right. Thus, in 
Table 2, if from Row C 1.87 is divided by 1.39, the result is 1.35, 
which is a computed ratio of A to B, and is listed in Table 3 in Row 
C, Column A-B. The direct experimental ratios may be carried over 
to Table 3 without calculation. At the foot of Table 3 are listed the 
average-judgment ratios between successive lines from longest to 
shortest, H being the shortest line. Immediately under these are 
given the actual ratios determined by dividing the lengths of the re- 
spective lines as determined by physical measurement. 

By assigning the number 1 to Line H, scale values may be com- 
puted through successive multiplication. The ratio of F to H is 2.57, 
hence the scale value of F is 2.57 X 1 or 2.57; the scale value for E 
is 1.39 X 2.57, and so on. These values were: H , 1.00; F, 2.57; E, 
3.07; I, 4.71; G, 5.09; C, 5.19; D, 7.27; and A, 9.16. By a similar 
procedure, using ratios from physical measurements, the scale values 
were: H, 1.00; F, 2.54; E, 3.38; I, 4.29; G, 4.55; C, 4.78; D, 6.31; 
B, 6.44; and A, 7.92. 


III, Discussion 

A comparison of the ratios of lengths of lines obtained by using 
the present scaling method and by calculation with physical measure- 
ments (see Table 3) reveals a close agreement. None of the ratios 
obtained by scaling was in error by more than .08. The scale values 
where shown to deviate more than this, however, because the judged 
ratios were more often than not slightly larger than the actual ra- 
tios as calculated from measurements. This resulted in a progres- 
sive error as the successive multiplication of scale values by ratios 
was carried out. Whether this condition is a constant error generally 
characteristic of the method rather than a chance occurrence does 
not seem evident at this time. 

The effect of practice upon judgment variability is shown in 
Figure 1. The subjects had no previous experience with the method 





























322 PSYCHOMETRIKA 


of judgment except 3 or 4 practice comparisons during the instruc- 
tions preceding the 36 experimental judgments. A consistent trend 
toward smaller standard deviations is noted with later trials. The 
standard deviations were identical for the distributions of judgments 
with any one pair of lines. That is, when Line A was compared with 
B, the standard deviation of the distribution of points given A was 
the same as that for B, since the total was 100 points for both. 

It is evident from Figure 1 that at least one other factor was 
influencing the judgment variability, since frequent fluctuations in 
the standard deviations occurred. Figure 2 shows a scatter plot of 
the relation between distribution variability and the deviation of 
average-judgment values from 50.00. A consistent curvilinear trend 
can be seen here. As two samples approach equality and extreme in- 
equality, the judgments become more consistent among judges. It is 
interesting to note, however, that this increased variability of judg- 
ments for samples differing moderately in size had no appreciable 
effect on the accuracy of ratios obtained for such samples. 


Y 


7.0 
6.0 
5.0 
4.0 


3.0 











1.0 X 
(6) 5 io 15 20 25 30 35 40 
FIGURE 1 
Judgment Variability as a Function of Number of Trials. (Trials are shown 
along the X-axis and standard deviation of points assigned in comparative judg- 
ments along the Y-axis.) 








-— 


Se Ew Ww 


ome te ee lw 





ANDREW L. COMREY 323 








¥ 4 
7.0 
6.0 
017 033 O22 
OIS 
5.0 
02 018 om 
[oF] . oO? 
ow OM 023 
©27 fo} 
4.0 Os O7 
1690 026 06 ! 
Ol2 =~ 3 Oy pore 
need ox 
3.0 
©29 ©30 
032 
201 ose 
os 
1.0 : ¥ 
(e) 5 10 is 20 25 30 35 40 
FIGURE 2 


Relationship Between Differences in Average Number of Points Assigned 
to Each Member of a Sample Pair (X-axis) and Standard Deviation of Corre- 
sponding Trial (Y-axis) as a Function of Trial Number (Numbers Adjacent 
to Points). 


Asymmetry and transitivity as requirements of order are clear- 
ly characteristic of this type of scale. If A is greater than B, it is im- 
possible for B to be judged greater than A, at the same time. Fur- 
thermore, if A receives more points in judgments than B and B re- 
ceives more than C, then A is greater than B and B is greater than 
C; it is then necessary that A be greater than C by this criterion. 
The symmetrical aspect of equality is evidently satisfied, since if 
A=B, A receives 50 points in its comparisons with B. Since A plus 
B must total 100 points, it is certainly impossible for B not to equal A. 
The transitive aspect of the equality state is not so clearly satisfied, 
at the level of individual judgments. However, at the stage of final 
ratios and scale values, which is the important point, transitivity will 
definitely be characteristic of the equality state. That is, if A = B, 
then the ratio of A to B is, by definition, 1.00. If B = C, the ratio 
of B to C is 1.00. When scale values are computed, A, B, and C will 











3824 PSYCHOMETRIKA 


all have the same scale value, and A will equal B. It is possible 
for a reversal to occur at the level of individual judgment, but not 
in the end result. An examination of the empirical data shows that 
such comparisons as are implied by the judged values with respect 
to the relations “greater,” “less”, and “equals,” are verified by the 
physical measurements, within the limits of a small margin of error. 
Some margin of error is always present in measurement, of what- 
ever kind. 

Thus far, it has been shown that at least an ordinal scale is de- 
rived by using this method. Further than this, however, the scale 
definitely has some of the characteristics commonly associated with 
ratio scales. Ratio judgments are implicit in the procedure, obvious- 
ly. It but remains to examine the empirical data to determine wheth- 
er such judgments are in fact obtained. A comparison of the judged 
and measured ratios certainly confirms the possibility of obtaining 
accurate ratios with at least one sample medium, for adjacent mem- 
bers of the rank-order series. 

Providing the present slight constant error of estimation proves 
to be a chance occurrence, or providing a correction is applied to 
eliminate such error, ratio-scale characteristics, including equal units, 
will be established, within the limits of experimental error. Such er- 
ror will unquestionably vary for different sample media and judges, 
but this is a matter of practical rather than theoretical concern. 
Since the scale inherently possesses ratio properties, the task becomes 
one of ferreting out and minimizing the sources of error. In the 
present case, the error is certainly not large. At worst, the ratios 
between adjacent samples in the rank-order series must be consid- 
ered reasonably dependable from the evidence obtained in the pres- 
ent application. 

A possible application for the method of considerable interest is 
in the area of personnel evaluation. Where careful and accurate 
quantitative evaluations are needed for criteria, this procedure has 
many advantages. A set of quantitative scale values can be obtained 
which approach the theoretical unit requirements for product- 
moment correlations and other statistical treatments. Furthermore, 
the numerical values represent an internally consistent system for 
the group of judges employed. The amount of labor involved need 
not be excessive, since the personnel to be evaluated can be broken up 
into smaller groups with one member common to any two groups. 
This common member can serve as a reference point for locating two 
sets of ratios on the same scale. In general, the method should be 

















ANDREW L. COMREY 325 


useful for all uni-polar continua where the stimuli can be presented 
by pairs. 


Manuscript received 5/19/49. 
Revised manuscript received 12/5/49. 














PSYCHOMETRIKA—VOL. 15, NO. 3 
SEPTEMBER, 1950 


BOOK REVIEWS 


PALMER O. JOHNSON. Statistical Methods in Research. New York: Pren- 
tice-Hall, Inc., 1949. Pp. xviii + 377. $7.65. 

Planned primarily for graduate students majoring in education and psy- 
chology and based upon the content of a one-year course in advanced statistics, 
Palmer Johnson’s Statistical Methods in Research is intended to furnish stu- 
dents an up-to-date treatment of statistical theory and practice. The author has 
achieved a favorable balance between mathematical analysis on the one hand 
and practical applications on the other, although many instructors are likely to 
believe that he has overemphasized the mathematical aspects of statistical anal- 
ysis in educational and psychological research. However, a thoroughgoing, rig- 
orous, and advanced text in the quantitative methods of education and psy- 
chology that stresses the recent developments in statistical analysis has been 
long overdue. Johnson has succeeded admirably in writing such a text. 

It is assumed by the author that those students using his book have had an 
introductory course in descriptive statistics. He has suggested that even the 
student without calculus should be able to profit considerably from the non- 
technical and logical treatment in many of the sections. Although the author’s 
impression may be correct, the reviewer would strongly urge that no one should 
attempt to study seriously a large portion of the material in the text unless he 
has had about the equivalent of three years of undergraduate mathematics in- 
cluding courses in college algebra, trigonometry, analytics, differential and in- 
tegral calculus (three semesters), and introductory mathematical statistics. The 
succinct manner in which the author writes — to say nothing of the lavish use 
of summational notation and subscripts that are sometimes highly abbreviated 
in what they designate — places a premium upon the mathematical maturity 
of the reader. Frequently the reader has to fill in gaps to assure himself of an 
adequate understanding of the development of a given topic. 

One or more chapters are devoted to each of the following topics: prob- 
ability and likelihood, sampling distributions, procedures in testing statistical 
hypotheses, estimation of population parameters, theory and practice of sam- 
pling, the analysis of variance and covariance, principles and techniques of ex- 
perimentation, and multiple regression procedures including the discriminant 
function. One whole chapter is given to normal and normalized distributions; 
and another, to procedures used in the analysis of data not fulfilling the as- 
sumptions of normality. Particularly methodical is Johnson’s approach in build- 
ing an understanding of the nature of statistical inference. 

The chapters in which procedures and applications are emphasized assume 
the appearance of well known handbooks in the physical sciences and engineer- 
ing. For one who has read through the text once and has gained a reasonable 
familiarity with it, such a format is an excellent one. Quick reference to model 
examples, each of which illustrates one type of problem involving specific (and 
clearly stated) assumptions, is thus possible. Incidentally it may be said that 
the use of non-fictitious or “live” data in the illustrative examples is a com- 


327 














328 PSYCHOMETRIKA 


mendable feature of the text and a welcome change from the practice of so many 
text book writers. 

Other noteworthy points include numerous references to primary source 
material at the end of each chapter, problems for students based upon real data, 
a table of the distribution of t for probability values other than .05 and .01, one 
of Nayer’s tables for testing the significance of differences in the variances 
among several samples of the same size, a comprehensive table of contents, and 
a detailed index. 

Although some consideration is given to the estimation of test reliability, 
the fact that Johnson does not cover several important aspects of test theory and 
scaling and does not treat factor analysis might be considered a minor limita- 
tion to the usefulness of the text in graduate courses in which these topics 
are studied. However, the amount of material that has been presented is more 
than enough for a two-semester graduate course in psychology and education. 

One fundamental weakness appears to be a failure on part of the autior 
to discuss at length the major limitations of the analysis-of-variance methods and 
of the experimental designs to which these methods are applied. Of course, this 
shortcoming may reflect the existence of a systematic point of view, the result 
of which is that the usefulness and general superiority of the factorial design 
in educational research over the single variable design are stressed at the ex- 
pense of mention of the inadequacies of this type of comprehensive experimental 
design. 

Although the book may not be widely used as a text by a majority of in- 
structors in education and psychology because of its inherent difficulty, it will 
serve as a valuable reference source for the few giited students and research 
workers with adequate mathematical training who can comprehend its coverag2. 
Statistical Methods in Research will probably be one of the classics in the field 
of educational and psychological statistics for many years to come. 


Princeton University William B. Michael 


J. P. GUILFORD anp WILLIAM B. MICHAEL. The Prediction of Categories 
from Measurements: With Applications to Personnel Selection and Clinical 
Prognosis. Beverly Hills: Sheridan Supply Company, 1949. Pp. 55. 


This small booklet provides a clear, explicit, and comprehensive discussion 
of the problems and possible solutions involved in predicting a membership in 
dichotomous categories and in setting critical scores for prediction purposes. 

The content is principally concerned with possible answers to the two ques- 
tions posed early in the discussion, namely: 

(1) What is the probability that an individual with a known score on the 
continuous independent variable will be in a designated category of the (dichoto- 
mous) dependent variable? 

(2) Given the probability of an individual’s being in a designated cate- 
gory of the (dichotomous) dependent variable, what should be the critical, or 
cut-off, score on the continuous independent variable? 

Solutions are considered separately for the case in which the dependent 
variable is an artificial dichotomy and in the case in which the dichotomy is a 
true one, although some of the proposed solutions are presented as being applic- 
able in both instances. 




















BOOK REVIEWS 329 


Graphic procedures for determining probability of belonging in a given cate- 
gory are presented and illustrated by computational examples. Procedures for 
computing this probability from tables of the normal bivariate surface are also 
given for possible application when the dichotomous criterion variable may be 
legitimately assumed to be continuous and normally distributed. This, together 
with brief consideration of the proper means for optimal combination of vari- 
ables in predicting a dichotomous criterion, comprises the major share of the 
discussion devoted to the first of the above questions. The authors appear to 
favor the multiple correlation solution proposed by Wherry to the more widely 
publicized discriminant function for multiple prediction of a dichotomy. 

In considering possible answers to the second question, techniques are pre- 
sented for determining the critical score that will yield any desired probability 
of success (or membership in a category). This in a sense gives a general solu- 
tion to the critical score problem in that it allows choice of the proportion of 
failure that can be tolerated. The answer to the problem of the optimal critical 
score is sought in Guttman’s principle of equal likelihood. Considerable space 
is devoted to demonstrating the adequacy of this principle, and to a somewhat 
elaborate consideration of the manner in which critical scores satisfying this 
principle vary with variation of such parameters as the percentage in a given 
category of the dichotomy, the difference between the mean predictor scores in 
the two categories, etc. 

This reviewer regards the solution deriving from the principle of equal 
likelihood as inadequate, and questions, consequently, the rather considerable 
emphasis and favorable treatment accorded it. Guilford and Michael, appar- 
ently, base their support of this solution upon their demonstration that critical 
scores satisfying the principle of equal likelihood minimize errors; that, is mini- 
mize the total of the predicted “failures” who would have succeeded and the 
predicted “successes” who would have failed. The reader will note the implied 
assumption that both types of errors are of equal importance. This assump- 
tion appears totally unwarranted to this reviewer. Generally speaking, the 
simpler solution of setting cutting scores as a function of the ratio of number 
of vacancies to number of applicants would seem much more satisfactory. It 
would seem evident that few hiring agencies would choose to disregard the 
benefits deriving from an advantageous selection ratio in the interests of re- 
ducing the proportion of rejected individuals who would have been successful. 
Critical scores based on the principle of equal likelihood would often lead to 
such a choice. 

In general, however, the presentation is excellent. A number of charts and 
nomographs are presented which will be found useful to those with problems 
in the areas discussed by the authors. 

Department of the Army Hubert E. Brogden 


BOOKS RECEIVED 


WILLIAM G. COCHRAN AND GERTRUDE M. Cox. Experimental Designs. New York: 
John Wiley & Sons, 1950. 

J. P. Guritrorp. Fundamental Statistics in Psychology And Education. New 
York: McGraw-Hill Book Co., 1950. 

DoNALD R. TAFT. Criminology. New York: The MacMillan Co., 1950. 














Ped 


